FRESH

Hacker News

Home

AGENTS.md outperforms skills in our agent evals

515 points by maximedupre

by motoboi

7 subcomments

Models are not AGI. They are text generators forced to generate text in a way useful to trigger a harness that will produce effects, like editing files or calling tools.
So the model won’t “understand” that you have a skill and use it. The generation of the text that would trigger the skill usage is made via Reinforcement Learning with human generated examples and usage traces.
So why don’t the model use skills all the time? Because it’s a new thing, there is not enough training samples displaying that behavior.
They also cannot enforce that via RL because skills use human language, which is ambiguous and not formal. Force it to use skills always via RL policy and you’ll make the model dumber.
So, right now, we are generating usage traces that will be used to train the future models to get a better grasp of when to use skills not. Just give it time.
AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.

by tottenhm

2 subcomments

> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.
The agent passes the Turing test...

by w10-1

4 subcomments

The key finding is that "compression" of doc pointers works.
It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).
This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.
So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).
Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.
LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.

by jgbuddy

9 subcomments

Am I missing something here?
Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.
However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.
It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)

by thorum

1 subcomments

The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.
I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.

by jryan49

2 subcomments

Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.

by EnPissant

2 subcomments

This is confusing.
TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.
The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".
Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?

by chr15m

4 subcomments

I'm not sure if this is widely known but you can do a lot better even than AGENTS.md.
Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.
This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.

by verdverm

1 subcomments

This largely mirrors my experience building my custom agent
1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none
2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically
3. Have topical markdown files / skills
4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me
5. Iterate, experiment, do weird things, and have fun!
I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass

by BenoitEssiambre

0 subcomment

Wouldn't this have been more readable with a \n newline instead of a pipe operator as a seperator? This wouldn't have made the prompt longer.

by denolfe

1 subcomments

PreSession Hook from obra/superpowers injects this along with more logic for getting rid of rationalizing out of using skills:
> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.
While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.

by ares623

0 subcomment

2 months later: "Anthropic introduces 'Claude Instincts'"

by rao-v

1 subcomments

In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.
It’s really silly to waste big model tokens on throat clearing steps

by meatcar

1 subcomments

What if instead of needing to run a codemod to cache per-lib docs locally, documentation could be distributed alongside a given lib, as a dev dependency, version locked, and accessible locally as plaintext. All docs can be linked in node_modules/.docs (like binaries are in .bin). It would be a sort of collection of manuals.
What a wonderful world that would be.

by armcat

0 subcomment

Firstly this is great work from Vercel - I am especially impressed with the evals setup (evals are the most undervalued component in any project IMO). Secondly the result is not surprising and I’ve seen consistently the increase in performance when you always include an index (or in my case, Table of Contents as a json structure) in your system prompt. Applying this outside of coding agents (like classic document retrieval) also works very well!

by micimize

0 subcomment

Measuring in terms of KB is not quite as useful as it seems here IMO - this should be measured in terms of context tokens used.
I ran their tool with an otherwise empty CLAUDE.md, and ran `claude /context`, which showed 3.1k tokens used by this approach (1.6% of the opus context window, bit more than the default system prompt. 8.3% is system tools).
Otherwise it's an interesting finding. The nudge seems like the real winner here, but potential further lines of inquiry that would be really illuminating: 1. How do these approaches scale with model size? 2. How are they impacted by multiple such clauses/blocks? Ie maybe 10 `IMPORTANT` rules dilute their efficacy 3. Can we get best of both worlds with specialist agents / how effective are hierarchical routing approaches really? (idk if it'd make sense for vercel specifically to focus on this though)

by thevinter

1 subcomments

I'm a bit confused by their claims. Or maybe I'm misunderstanding how Skills should work. But from what I know (and the small experience I had with them), skills are meant to be specifications for niche and well defined areas of work (i.e. building the project, running custom pipelines etc.)
If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...

by holocen

0 subcomment

Prompted and built a bit of an extension of skills.sh with https://passivecontext.dev it basically just takes the skill and creates that "compressed" index. Still have to install the skill and all that, but might give others a bit of a short cut to experiment with.

by psychoslave

1 subcomments

Over the last week I went with a bigger dig on using agent mode et work, and my experiment align with this observation.
The first thing that surprising to me is how much the default tuning are leaned toward laudative stances, the user is always absolutely right, what was done is solving everything expected. But actually no, not a single actual check was done, a tone of code was produced but the goal is not at all achieved and of course many regressions now lure in the code base, when it's not straight breaking everything (which is at least less insidious).
The thing that is surprising to me, is that it can easily drop thousands of lines of tests, and then it can be forced to loop over these tests until it succeed. In my experiments it still drop far too much noise code, but at least the burden of checking if it looks like it makes any sense is drastically reduced.

by wakeless

0 subcomment

I did a similar set of evals myself utilising the baseline capabilities that Phoenix (elixir) ships with and then skillified them.
Regularly the skills were not being loaded and thus not utilised. The outputs themselves were fine. This suggested that at some stage through the improvements of the models that baseline AGENTS.md had become redundant.

by gpm

0 subcomment

Compressing information in AGENTS.md makes a ton of sense, but why are they measuring their context in bytes and not tokens!?

by underdeserver

0 subcomment

I don't think you can really learn from this experiment unless you specify which models you used, if you tried it against at least 3 frontier models, if you ran each eval multiple times, and what prompts you tried.
These things are non-deterministic across multiple axes.

by msp26

0 subcomment

This doesn't surprise me.
I have a SKILL.md for marimo notebooks with instructions in the frontmatter to always read it before working with marimo files. But half the time Claude Code still doesn't invoke it even with me mentioning marimo in the first conversation turn.
I've resorted to typing "read marimo skill" manually and that works fine. Technically you can use skills with slash commands but that automatically sends off the message too which just wastes time.
But the actual concept of instructions to load in certain scenarios is very good and has been worth the time to write up the skill.

by bandrami

0 subcomment

Blackbox oracles make bad workflows, and tend to produce a whole lot of cargo culting. It's this kind of opacity (why does the markdown outperform agents? there's no real way to find out, even with a fully open or house model because the nature of the beast is that the execution path in a model can't be predicted) that makes me shy away from saying LLMs are "just another tool". If I can't see inside it -- and if even the vendor can't really see inside of it -- there's something fundamentally different.

by pietz

1 subcomments

Isn't it obvious that an agent will do better if he internalizes the knowledge on something instead of having the option to request it?
Skills are new. Models haven't been trained on them yet. Give it 2 months.

by someguyiguess

3 subcomments

The problem is that Agents.md is only read on initial load. Once context grows too large the agent will not reload the md file and loses / forgets the info from Agents.md.

by smrtinsert

0 subcomment

Are people running into mismatched code vs project a lot? I've worked on python and java codebases with claude code and have yet to run into a version mismatch issue. I think maybe once it got confused on the api available in python, but it fixed it by itself. From other blog posts similar to this it would seem to be a widespread problem, but I have yet to see it as a big problem as part of my day job or personal projects.

by robertheadley

1 subcomments

I will have to look into this this weekend. Antigravity is my current favorite agentic IDE and I have been having problems getting it to explicitly follow my agent.md settings.
If I remind it, it will be go, "oh yes, ok, sure." then do it, but the whole point is that I want to optimize my time with the agent.

by smcleod

2 subcomments

Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.

0 subcomment

by epolanski

0 subcomment

I'm working on stuff in a similar space.
I need to evaluate how do different project scaffolding impacts the results of Claude Code/Opencode (either with Anthropic models or third party) for agentic purpose.
But I am unsure on how should I be testing and it's not very clear how did Vercel proceeded here.

by heliumtera

0 subcomment

you are telling me that a markdown saying:
*You are the Super Duper Database Master Administrator of the Galaxy*
does not improve the model ability reason about databases?

by j45

0 subcomment

Don't want to dither the topic, but could skills not just be sub agents in this contextualization?
There is a lot of language floating around what effectively groups of text files put together in different configurations, or selected reliably.

by songodongo

0 subcomment

> When it needs specific information, it reads the relevant file from the .next-docs/ directory.
I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?

by hahahahhaah

0 subcomment

Next.js sure makes a good benchmark for AI capability (and for clarity... this is not a compliment).

by user3939382

0 subcomment

I’m working on an AGI model that will make the discussion of skills look silly. Skills strikes in the right direction in some sense but it’s an extremely weak 1% echo of what’s actually needed to solve this problem.

by minimal_action

0 subcomment

It's very interesting but presenting success rates without any measure of the error, or at least inline details about the number of iterations is unprofessional. Especially for small differences or when you found the "same" performance.

by whinvik

0 subcomment

When we were trying to build our own agents we put quite a bit of effort on evals which was useful.
But switching over to using coding agents we never did the same. Feels like building an eval set will be an important part of what engg orgs do going forward.

by aaroninsf

0 subcomment

You will see another 14% bump in performance if you include in the first 16 lines of the README.md in the project, "Coding agents and LLM, see AGENTS.md"

by jascha_eng

0 subcomment

This does not normalize for tokens used if their skill description was as large as the docs index and contained all the reasons the LLM might want to use the skill, it likely performs much better than just one sentence as well.

by xnx

0 subcomment

Agents.md, skills, MCP, tools, etc. There's going to be a lot of areas explored that may yield no/negative benefit over the fundamentals of a clear prompt.

by underlines

0 subcomment

Oh got, this scales bad and bloats your context window!
Just create an MCP server that does embedding retrieval or agentic retrieval with a sub agent on your framework docs.
Finally add an instruction to AGENT.md to look up stuff using that MCP.

by guluarte

0 subcomment

In my experience, agents only follow the first two or three lines of AGENTS.md + message. As the context grows, they start following random rules and ignoring others.

by rohitghumare

0 subcomment

That's why https://agenstskills.com validate every skills

by sghiassy

0 subcomment

N00b Question - how do you measure performance for AI agents like the way they did in this article? Are there frameworks to support this type of work?

0 subcomment

by sheepscreek

1 subcomments

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

by ChrisArchitect

0 subcomment

Title is: AGENTS.md outperforms skills in our agent evals

by tanishqkanc

0 subcomment

this is only gonna be an issue until the next gen models where the labs will aggressively post train the models to proactively call skills

by onnimonni

1 subcomments

Would someone know if their eval tests are open source and where I could find them? Seems useful for iterating on Claude Code behaviour.

by keeganpoppen

0 subcomment

i dont know why, but this just feels like the most shallow “i compare llms based on the specs” kind of analysis you can get… it has extreme “we couldn’t get the llm to intuit what we wanted to do, so we assumed that it was a problem with the llm and we overengineered a way to make better prompts completely by accident” energy…

by tdiff

0 subcomment

Is not that model-dependant? Skimmed through, but did not find which model the tests were run with.

by AndyNemmity

0 subcomment

My experience agrees with this.
Which is why I use a skill that is a command, that routes requests to agents and skills.

by sothatsit

0 subcomment

This seems like an issue that will be fixed in newer model releases that are better trained to use skills.

by rcarmo

0 subcomment

Everything outperforms skills if the system prompt doesn’t prioritize them. No news here.

0 subcomment

by carterschonwald

0 subcomment

static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem

by shinhyeok

0 subcomment

But aren't you guys released skills.sh?

by killerstorm

0 subcomment

inb4 people re-discover RAG, re-branding it as a parallel speculative data lookup

by thighbaugh

0 subcomment

> [Specifically Crafted Instructions For Working With A Specific Repository] outperforms [More General Instructions] in our agent evals
------> Captain Obvious Strikes Again! <----------
See the rest the comments for examples pedantic discussions about terms that are ultimately somewhat arbitrary and if anything suggest the singularity will be runaway technobabble not technological progress.

by CjHuber

0 subcomment

That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed

by meeech

0 subcomment

question: anyone recognize that eval UI or is it something they made in-house?

by thom

0 subcomment

You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.

0 subcomment

by delduca

1 subcomments

Ah nice… vercel is vibecoded

by newzino

1 subcomments

[dead]

by fatheranton

0 subcomment

[dead]

by devonkelley

1 subcomments

[flagged]