Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)
Having said that, I think if you're going to write an article like this and call it "The Emperor Has No Clothes: How to Code Claude Code in 200 Lines of Code", you should at least include a reference to Thorsten Ball's excellent article from wayyy back in April 2025 entitled "How to Build an Agent, or: The Emperor Has No Clothes" (https://ampcode.com/how-to-build-an-agent)! That was (as far as I know) the first of these articles making the point that the core of a coding agent is actually quite simple (and all the deep complexity is in the LLM). Reading it was a light-bulb moment for me.
FWIW, I agree with other commenters here that you do need quite a bit of additional scaffolding (like TODOs and much more) to make modern agents work well. And Claude Code itself is a fairly complex piece of software with a lot of settings, hooks, plugins, UI features, etc. Although I would add that once you have a minimal coding agent loop in place, you can get it to bootstrap its own code and add those things! That is a fun and slightly weird thing to try.
(By the way, the "January 2025" date on this article is clearly a typo for 2026, as Claude Code didn't exist a year ago and it includes use of the claude-sonnet-4-20250514 model from May.)
Edit: and if you're interested in diving deeper into what Claude Code itself is doing under the hood, a good tool to understand it is "claude-trace" (https://github.com/badlogic/lemmy/tree/main/apps/claude-trac...). You can use it to see the whole dance with tool calls and the LLM: every call out to the LLM and the LLM's responses, the LLM's tool call invocations and the responses from the agent to the LLM when tools run, etc. When Claude Skills came out I used this to confirm my guess about how they worked (they're a tool call with all the short skill descriptions stuffed into the tool description base prompt). Reading the base prompt is also interesting. (Among other things, they explicitly tell it not to use emoji, which tracks as when I wrote my own agent it was indeed very emoji-prone.)
The agent "boots up" inside the REPL. Here's the beginning of the system prompt:
>>> help(assistant)
You are an interactive coding assistant operating within a Python REPL.
Your responses ARE Python code—no markdown blocks, no prose preamble.
The code you write is executed directly.
>>> how_this_works()
1. You write Python code as your response
2. The code executes in a persistent REPL environment
3. Output is shown back to you IN YOUR NEXT TURN
4. Call `respond(text)` ...
You get the idea. No need for custom file editing tools--Python has all that built in and Claude knows it perfectly. No JSON marshaling or schema overhead. Tools are just Python functions injected into the REPL, zero context bloat.I also built a browser control plugin that puts Claude directly into the heart of a live browser session. It can inject element pickers so I can click around and show it what I'm talking about. It can render prototype code before committing to disk, killing the annoying build-fix loop. I can even SSH in from my phone and use TTS instead of typing, surprisingly great for frontend design work. Knocked out a website for my father-in-law's law firm (gresksingleton.com) in a few hours that would've taken 10X that a couple years ago, and it was super fun.
The big win: complexity. CC has been a disaster on my bookkeeping system, there's a threshold past which Claude loses the forest for the trees and makes the same mistakes over and over. Code agent pushes that bar out significantly. Claude can build new tools on the fly when it needs them. Gemini works great too (larger context).
Have fun out there! /end-rant
I should start a blog with my experience from all of this.
At a high level it seems to usually be one (or a mix) of:
- full transcript appended every turn
- sliding window of the last N turns / tokens
- older turns summarized into a rolling memory
- structured state (goals, decisions, progress) rendered into the prompt
- external storage + retrieval (RAG-style) to pull in only relevant past info
Under the hood I’m sure it gets more complex, but the core idea is pretty simple once you strip away the mystique: memory = prompt assembly.
For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.
To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)
Still, good article. Agents really are just tools in a loop. It's not rocket science.
This really blew my mind back then in the ancient times of 2024-ish. I remember the idea of agents just reached me and I started reading various "here I built an agent that does this" articles, and I was really frustrated at not understanding how the hell LLM "knows" how to call a tool, it's a program, but LLMs just produce text! Yes I see you are telling LLM about tools, but what's next? And then when I finally understood that there's no next, no need to do anything other than explaining — it felt pretty magical, not gonna lie.
I think it's a great way to dive into the agent world
https://github.com/samsaffron/term-llm
It is about my 10th attempt at the problem so I am aware of a lot of the edge cases, a very interesting bit of research here is:
https://gist.github.com/SamSaffron/5ff5f900645a11ef4ed6c87f2...
Fascinating read.
If your agent can execute Bash commands, it can do anything, including reading files (with cat), writing them (with sed / patch / awk /perl), grepping, finding, and everything else you may possibly need. The specialized tools are just an optimization to make things easier for the agent. They do increase performance (in the "how much can this do", not the "how fast is this" sense), but they're not strictly required.
IMHO, this is one of the more significant LLM-related discoveries of 2025. You don't need a context-polluting Github MCP that takes 10+% of your precious context window, all you need is the gh cli, which the agent already knows how to use.
The hard part isn’t the loop - it’s the boring scaffolding that prevents early stopping, keeps state, handles errors, and makes edits/context reliable across messy real projects.
The core loop is straightforward: LLM + system prompt + tool calls. The differentiator is the harness, CLI, IDE extension, sandbox policies, filesystem ops (grep/sed/find). But what separates effective agents from the rest is context engineering. Anthropic and Manus has published various research articles around this topic.
After building vtcode, my takeaway: agent quality reduces to two factors, context management strategy and model capability. Architecture varies by harness, but these fundamentals remain constant.
[1] https://ampcode.com/how-to-build-an-agent [2] https://github.com/vinhnx/vtcode [3] https://www.anthropic.com/engineering/building-effective-age...
> I’m using OpenAI here, but this works with any LLM provider
Have you noticed there’s no OpenAI in the post?
We did just that back then and it worked great, we used it in many projects after that.
"What We Built vs. Production Tools This is about 200 lines. Production tools like Claude Code add:
Better error handling and fallback behaviors Streaming responses for better UX Smarter context management (summarizing long files, etc.) More tools (run commands, search codebase, etc.) Approval workflows for destructive operations
But the core loop? It’s exactly what we built here. The LLM decides what to do, your code executes it, results flow back. That’s the whole architecture."
But where's the actual test cases of the performance of his little bit of code vs. Claude Code? Is the core of Claude Code really just what he wrote (he boldly asserts 'exactly what we built here')? Where's the empirical proof?
I learned a similarly powerful way to build DIY coding CLIs from this Martin Fowler post, which uses PydanticAI and MCP-based tools: https://martinfowler.com/articles/build-own-coding-agent.htm...
Once you understand the underlying LLM tool-calling protocols described here—and how MCP tool calls work (they’re conceptually very similar)—most coding CLIs stop feeling like magic. Anthropic’s own deep dive on MCP was especially useful for me in seeing how to integrate this into a DIY “Claude Code”-style CLI, and even adapt the same approach for non-coding agents as well: https://www.deeplearning.ai/short-courses/mcp-build-rich-con...
Claude code feels like the first commodity agent. In theory its simple but in practice you'll have to maintain a ton of random crap you get no value in maintaining.
My guess is eventually all "agents" will be wipped out by claude code or something equivalent.
Maybe not the companies will die but that all those startups will just be hooking up a generic agent wrapper and let it do its thing directly. My bet is that that the company that would win this is the one with the most training data to tune their agent to use their harness correctly.
For example
- how can I reliably have a decision block to end the loop (or keep it running)?
- how can I reliably call tools with the right schema?
- how can I reliably summarize context / excise noise from the conversation?
Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.
Perhaps I should accept that it's a feature and not a bug? :)
A lot of SaaS has turned into this too. Take a bloated monstrosity like Salesforce and I bet 95% of customers would be very happy with a “bare bones” version that costs 1 10th the price.
The TODO injection nyellin mentions is a good example. It's not sophisticated ML - it's bookkeeping. But without it, the agent will confidently declare victory three steps into a ten-step task. Same with subagents - they're not magic, they're just a way to keep working memory from getting polluted when you need to go investigate something.
The 200-line version captures the loop. The production version captures the paperwork around the loop. That paperwork is boring but turns out to be load-bearing.
Oh, yes it's easy. That's just so cute.
def list_files_tool(path: str) -> Dict[str, Any]:
And it returns {
"path": str(full_path),
"files": all_files
}
Is that useful?Not trivial. Just...smaller than expected. It really made me think how often we mistook: surface level complexity product polish system depth Lately, I've been pondering: if I had to re-explain this from scratch, what is its irreducible core?
I'm curious to hear others: What system surprised you by being simpler than you expected? Where was the real complexity? What do people tend to overestimate?
But that's not correct. You give them write access to files it then compiles and executes. It could include code that then runs with the rights of the executing user to manipulate the system. It already has one foot past the door. And you'd have to set up all kinds of safeguards to make sure it doesn't walk outside completely.
It's a fundamental problem if you give agentic AI rights on your system. Which in contrast kind of is the whole purpose of agentic AI.
(it's just a http library wrapping anthropic's rest API; reimplementing it - including auth - would add enough boilerplate to the examples to make this post less useful, but I just found it funny alongside the title choice)
Feels like we're headed toward a world where everyone can build these loops easily. Curious what you think separates good uses of these agents from mediocre ones.
This phrase feels like the new em dash...
Yeah I agree there is bunch of BS tools on top that basically try to coerce people into paying and using their setup so they become dependent on that provider that provides some value but still they are so pushy that it is quite annoying.
To be clear I'm not implying any of that is useful but if you do want to go down that path then why not actually do it?
- https://github.com/rcarmo/bun-steward
- https://github.com/rcarmo/python-steward (created with the first one)
And they're self-replicating!
For example, post-training / finetuning the model specifically to use the tools it’ll be given in the harness. Or endlessly tweaking the system prompt to fine-tune the model’s behavior to a polish.
Plus - both OpenAI and Qwen have models specifically intended for coding.
proceeds to show a piece of code importing anthropic
was pretty confusing to me
This is a really nice open source coding agent implementation. The use of async is interesting.
Is this correct, and if so do we need to be concerned about user privacy and security?
from ddgs import DDGS
def web_search(query: str, max_results: int = 8) -> list[dict]:
return DDGS().text(query, max_results=max_results)Imagine a SDK that's dedicated to customizing tools like claude code/cursor cli to produce a class of software like b2b enterprise saas. Within the bounds of the domain(s) modeled these vertical systems would ultimately even crush the capabilities of thin low level wrappers we have today.
Not 200 lines of Python.
- Intelligence (the LLM)
- Autonomy (loop)
- Tools to have "external" effects
Wrinkles that I haven't seen discussed much are:
(1) Tool-forgetting: LLM forgets to call a tool (and instead outputs plain text). Some may say that these concerns will disappear as frontier models improve, there will always be a need for having your agent scaffolding work well with weaker LLMs (cost, privacy, etc), and as long as the model is stochastic there will always be a chance of tool-forgetting.
(2) Task-completion-signaling: Determining when a task is finished. This has 2 sub-cases: (2a) we want the LLM to decide that, e.g. search with different queries until desired info found, (2b) we want to specify deterministic task completion conditions, e.g., end the task immediately after structured info extraction, or after acting on such info, or after the LLM sees the result of that action etc.
After repeatedly running into these types of issues in production agent systems, we’ve added mechanisms for these in the Langroid[1] agent framework, which has blackboard-like loop architecture that makes it easy to incorporate these.
For issue (1) we can configure an agent with a `handle_llm_no_tool` [2] set to a “nudge” that is sent back to the LLM when a non-tool response is detected (it could also be set as a lambda function to take other possible actions). As others have said, grammar-based constrained decoding is an alternative but only works for LLM-APIs that support.
For issue (2a) Langroid has a DSL[3] for specifying task termination conditions. It lets you specify patterns that trigger task termination, e.g.
- "T" to terminate immediately after a tool-call,
- "T[X]" to terminate after calling the specific tool X,
- "T,A" to terminate after a tool call, and agent handling (i.e. tool exec)
- "T,A,L" to terminate after tool call, agent handling, and LLM response to that
For (2b), in Langroid we rely on tool-calling again, i.e. the LLM must emit a specific DoneTool to signal completion. In general we find it useful to have orchestration tools for unambiguous control flow and message flow decisions by the LLM [4].
[1] Langroid https://github.com/langroid/langroid
[2] Handling non-tool LLM responses https://langroid.github.io/langroid/notes/handle-llm-no-tool...
[3] Task Termination in Langroid https://langroid.github.io/langroid/notes/task-termination/
[4] Orchestration Tools: https://langroid.github.io/langroid/reference/agent/tools/or...
Any and every "AI" experience is just kiddie level program mg wrapping LLMs.