> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
If there’s a nugget of knowledge learned at any point in this conversation (not limited to the most recent exchange), please tersely update AGENTS.md so future agents can access it. If nothing durable was learned, no changes are needed. Do not add memories just to add memories.
Update AGENTS.md **only** if you learned a durable, generalizable lesson about how to work in this repo (e.g., a principle, process, debugging heuristic, or coding convention). Do **not** add bug- or component-specific notes (for example, “set .foo color in bar.css”) unless they reflect a broader rule.
If the lesson cannot be stated without referencing a specific selector or file, skip the memory and make no changes. Keep it to **one short bullet** under an appropriate existing section, or add a new short section only if absolutely necessary.
It hardly creates rules, but when it does, it affects rules in a way that positively affects behavior. This works very well.Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.
each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.
AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.
we wrote about it here: https://agnt.one/blog/black-hole-architecture
agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership
AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.
What wasn't measured, probably because it's almost impossible to quantify, was the quality of the code produced. Did the context files help the LLMs produce code that matched the style of the rest of the project? Did the code produced end up reasonably maintainable in the long run, or was it slop that increased long-term tech debt? These are important questions, but as they are extremely difficult to assign numbers to and measure in an automated way, the paper didn't attempt to answer them.
I added these to that file because otherwise I will have to tell claude these things myself, repeatedly. But the science says... Respectfully, blow it out your ass.
Even with the latest and greatest (because I know people will reflexively immediately jump down my throat if I don't specify that, yes, I've used Opus 4.6 and Gemini 3 Pro etc. etc. etc. etc., I have access to all of the models by way of work and use them regularly), my experience has been that it's basically a crapshoot that it'll listen to a single one of these files, especially in the long run with large chats. The amount of times I still have to tell these things to not generate React in my Vue codebase that has literally not a single line of JSX anywhere and instructions in every single possible file I can put it in to NOT GENERATE FUCKING REACT CODE makes me want to blow my brains out every time it happens. In fact it happened to me today with the supposed super intelligence known as Opus 4.6 that has 18 trillion TB of context or whatever in a fresh chat when I asked for a quick snippet I needed to experiment with.
I'm not even paying for this crap (work is) and I still feel scammed approximately half the time, and can't help but think all of these suggestions are just ways to inflate token usage and to move you into the usage limit territory faster.
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.
EDIT: It's amusing how sensitive the blue-pilled crowd is when confronted with reality. :)