FRESH

Hacker News

Home

Evaluating AGENTS.md: are they helpful for coding agents?

193 points by mustaphah

by deaux

6 subcomments

I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.
> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.

by pamelafox

3 subcomments

This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.

by amluto

3 subcomments

My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.

by avhception

9 subcomments

When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.

by medler

1 subcomments

Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”

by prodigycorp

1 subcomments

LLMs are generally bad at writing non-noisy prompts and instructions. It's better to have it write instructions post hoc. For instance, I paste this prompt into the end of most conversations:

  If there’s a nugget of knowledge learned at any point in this conversation (not limited to the most recent exchange), please tersely update AGENTS.md so future agents can access it. If nothing durable was learned, no changes are needed. Do not add memories just to add memories.
  
  Update AGENTS.md **only** if you learned a durable, generalizable lesson about how to work in this repo (e.g., a principle, process, debugging heuristic, or coding convention). Do **not** add bug- or component-specific notes (for example, “set .foo color in bar.css”) unless they reflect a broader rule.
  
  If the lesson cannot be stated without referencing a specific selector or file, skip the memory and make no changes. Keep it to **one short bullet** under an appropriate existing section, or add a new short section only if absolutely necessary.

It hardly creates rules, but when it does, it affects rules in a way that positively affects behavior. This works very well.

Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.

by rmnclmnt

2 subcomments

Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices

by pajtai

2 subcomments

I'd be interested to see results with Opus 4.6 or 4.5
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.

by kkapelon

2 subcomments

It is still baffling to me why we need AGENTS.md
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"

by energy123

2 subcomments

Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.

by eknkc

0 subcomment

I only put things when the LLM gets something wrong and I need to correct it. Like “no, we create db migrations using this tool” kind of corrections. So far it made them behave correctly in those situations.

by GBintz

1 subcomments

we've been running AGENTS.md in production on helios (https://github.com/BintzGavin/helios) for a while now.
each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.
AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.
we wrote about it here: https://agnt.one/blog/black-hole-architecture
agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership
AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.

by theLiminator

1 subcomments

I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.

by einrealist

1 subcomments

What is the purpose of an AGENTS.md file when there are so many different models? Which model or version of the model is the file written for? So much depends on assumptions here. It only makes sense when you know exactly which model you are writing for. No wonder the impact is 'all over the place'.

by flatcoke

0 subcomment

I use AGENTS.md daily for my personal AI setup. The biggest win is giving the agent project-specific context — things like deployment targets, coding conventions, and what not to do. Without it, the agent makes generic assumptions that waste time.

by mindwok

0 subcomment

In my experience AGENTS.md files only save a bit of time, they don't meaningfully improve success. Agents are smart enough to figure stuff out on their own, but you can save a few tool calls and a bit of context by telling them how to build your project or what directories do what rather than letting it stumble its way there.

by BlueHotDog2

0 subcomment

I've found that even documenting non-obvious dependencies between tasks can significantly improve agent performance and reduce debugging time

by 4b11b4

0 subcomment

This paper shoulda just done a study on elixirs usage_rules?
https://github.com/ash-project/usage_rules

by ozim

0 subcomment

I pretty much add to my prompt bunch of stuff, with AGENST.md or any file I can just add one line "hey read up that file".

by rmunn

0 subcomment

If I understand the paper correctly, the researchers found that AGENTS.md context files caused the LLMs to burn through more tokens as they parsed and followed the instructions, but they did not find a large change in the success rate (defined by "the PR passes the existing unit tests in the repo").
What wasn't measured, probably because it's almost impossible to quantify, was the quality of the code produced. Did the context files help the LLMs produce code that matched the style of the rest of the project? Did the code produced end up reasonably maintainable in the long run, or was it slop that increased long-term tech debt? These are important questions, but as they are extremely difficult to assign numbers to and measure in an automated way, the paper didn't attempt to answer them.

0 subcomment

by reconnecting

0 subcomment

Check the logs, no one really requests AGENTS.md from the server.

by mikkupikku

1 subcomments

The only thing I use CLAUDE.md for is explaining the purpose and general high level design principles of the project so I don't have to waste my time reiterating this every time I clear the context. Things like this is a file manager, the deliverable must always be a zipapp, Wayland will never be supported.
I added these to that file because otherwise I will have to tell claude these things myself, repeatedly. But the science says... Respectfully, blow it out your ass.

by sensanaty

1 subcomments

Most of these AI-guiding "techniques" seem more like reading into tea leaves to me than anything actually useful.
Even with the latest and greatest (because I know people will reflexively immediately jump down my throat if I don't specify that, yes, I've used Opus 4.6 and Gemini 3 Pro etc. etc. etc. etc., I have access to all of the models by way of work and use them regularly), my experience has been that it's basically a crapshoot that it'll listen to a single one of these files, especially in the long run with large chats. The amount of times I still have to tell these things to not generate React in my Vue codebase that has literally not a single line of JSX anywhere and instructions in every single possible file I can put it in to NOT GENERATE FUCKING REACT CODE makes me want to blow my brains out every time it happens. In fact it happened to me today with the supposed super intelligence known as Opus 4.6 that has 18 trillion TB of context or whatever in a fresh chat when I asked for a quick snippet I needed to experiment with.
I'm not even paying for this crap (work is) and I still feel scammed approximately half the time, and can't help but think all of these suggestions are just ways to inflate token usage and to move you into the usage limit territory faster.

by 0xbadcafebee

1 subcomments

Research has shown that most earlier "techniques" to get better LLM response no longer work and are actively harmful with modern models. I'm so grateful that there's actual studies and papers about this and that they keep coming out. Software developers are super cargo culty and will do whatever the next guy does (and that includes doing whatever is suggested in research papers)

by Razengan

0 subcomment

I think they can be helpful for humies too: the act of writing the instructions and describing your stuff in a clear way, and also reading it later.

0 subcomment

by Arifcodes

1 subcomments

[dead]

by Arifcodes

0 subcomment

[dead]

by octoclaw

0 subcomment

[dead]

by kittbuilds

0 subcomment

[dead]

by AlexYzhov

0 subcomment

[dead]

by szundi

0 subcomment

[dead]

by imiric

0 subcomment

Many of the practices in this field are mostly based on feelings and wishful thinking, rather than any demonstrable benefit. Part of the problem is that the tools are practically nondeterministic, and their results can't be compared reliably.
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.
EDIT: It's amusing how sensitive the blue-pilled crowd is when confronted with reality. :)