This feels a bit like one of those “now you have two problems” solutions. After a few dozen sessions I would expect the tool registry to be full of “noise” for most prompts. I would also expect most tools to be extremely specific to the task at hand, leading to redundancy and ultimately poor programmability due to inconsistencies between tool APIs.
At $DAYJOB, we have an LLM-based tool and this issue of "how do we avoid burning tokens solving the same problems over again" was an early obstacle
We wound up building a very similar thing to what you call "tools" (we named them "Saved Programs").
There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.
If it finds one, it'll re-use it, otherwise it'll generate a program and offer to save it, if you think it'll be common enough.
I think right now this is still a bit too fresh out of Claude Code to be usable by anybody but the people developing it. I got to around the same point with my first tempt at building a tool registry (https://github.com/accretional/collector) and then realized I basically needed to start over with much more investment in supporting infrastructure to build the thing I really wanted.
I can go as far into the weeds as anybody would ever care to hear about this, but for the sake of brevity I’ll just say this: reflection and type systems over the network are pretty much the only way to get this stuff to work properly (I mean you could just go full MCP/Skills but then all you really have are giant blobs of markdown and unconstrained json that make integration/discovery/usability a nightmare, and require an agent in the loop to drive/integrate the tools when you really just need to give them the actual APIs and documentation). That ends up getting rather hairy, we ended up actually building a declarative meta-lexer/parser/transpiler (meta basically just meaning it’s generalized across languages and self-hosting/bootstrapped) recently (https://github.com/accretional/gluon) because it turns out building a cross-language distributed type system is rather difficult. But reflection alone gets you halfway there as far as benefits.
WHEN is upstream of WHAT and HOW. You can have perfect tool descriptions and perfect call signatures, but if the model can't read the situation to know whether the moment calls for any tool at all, you get either over-firing (agent burns tokens trying to "help") or under-firing (agent waits to be addressed and acts like a chatbot, not an autonomous participant).
I have had a lot of success when I refrain from codifying WHEN as rules. "If X then fire tool Y" is a dumb heuristic with extra steps. Describe the conditions of the moment. What's been tried, what's converged, what state the work is in. Then let the model decide whether to act and which tool fits.
Rules get stale. Situation-reads generalize.
Reading the Tendril README, looks like the registration mechanic is solving a slightly different problem (the "too many tools" / context-bloat problem) by giving the agent three bootstrap tools and a growing registry. The WHEN itself still seems to be codified as rules in the system prompt ("BEFORE acting, call searchCapabilities; IF found, load and execute; IF NOT found, build yourself"). That's exactly the IF-X-THEN-Y pattern your framing seems to want to move past.
Curious whether you see the registry itself as the structured WHEN, or whether the rule-based system prompt is a starting point you intend to evolve toward something more situational.
Tendril is a reference implementation of what I'm calling the Agent Capability pattern. It starts with three bootstrap tools and builds everything else itself. The key constraint: there's no direct code execution. The agent can only run registered capabilities, so every task forces it to write a tool, define its invocation conditions, and register it for future sessions. The registry accumulates across sessions.
I also ran the self-extending loop against five local models — Qwen3-8B, Gemma 4, Mistral Small 3.1, Devstral Small 2, Salesforce xLAM-2. None passed.
The failure modes were distinct enough to be worth writing up separately: https://serverlessdna.com/strands/ai-agents/agents-know-what...
Stack: AWS Strands TypeScript SDK, Bedrock (Claude Sonnet), Deno sandbox, Tauri + React desktop shell.
It can update those notes automatically, but I’ve found that even with regular nudges, models are still somewhat reluctant to do it.
So manually running /learn every now and then, especially when I can tell it didn’t take the most direct path, helps.
Of course, being reliable and reliably extensible is the whole point, which means Claude Code made a better OC than OC did! I found this very amusing for some reason.
Also you can put it (or your agent of choice, e.g. codex works too) in a Telegram bot in like 50 lines of code which is a lot of fun.
https://github.com/a-n-d-a-i/ULTRON/blob/main/src/index.ts
Though this might get you banned from Anthropic, they haven't quite clarified that yet. (Ostensibly it defaults to extra usage now, but who knows.)
The main design decision we took was to integrate with your existing agent instead of building a new one. Your harness, swamp, and you're off.
As an aside, building software for agents is incredibly fun.
It's evolved into a mesh-based operating system, gained it's own GPU-based AI library/runtime, and even molted and extended itself to ESP nodes.
Getting closer to a full release sometime in May. For now, pieces are released on my github.
I think this is a simple and effective solutions if you have a dozen or two of tools, maybe it won’t scale to hundreds or thousands, but that will be a problem for tomorrow’s me.
Which kind of solves the when should we write a tool part by just saying always.
But I think the question is how will this scale. The real core issue I feel like I’ve been encountering is scaling complexity.
Reducing the number of tools without losing efficiency or capability.
Reducing duplication, abstracting, cleaning up, and maintaining knowledge and memory.
I think the issue for me has been threefold.
1. As the repo grows how does you make the agent keep understanding of it without excessive context pollution.
2. How do you maintain memory and knowledge over time.
3. How do you know the agent is performing better over time and not regressing as you evolve.
And what has somewhat been working for me is
A) trees or hierarchies.
Trees scale well. Folder structure but also in the form of just simple indices.
Logical structure and locality makes them even more effective.
B) caching.
Having the agents “cache” their thinking in the form of summaries, skills, tools.
Recursive summarization really helped with mono repo navigation for me.
But right now I still feel like I need to be constantly prompting them and I can’t quite close the feedback loop.