by beshrkayali
12 subcomments
- > long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info)
I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.
I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]
1: https://github.com/ossature/ossature
2: https://github.com/beshrkayali/chomp8
3: https://github.com/ossature/ossature-examples
- I still find it incredible at the power that was unleashed by surrounding an LLM with a simple state machine, and giving it access to bash
- The example is really lean and straightforward. I don't use coding agents, but this is some good overview and should help everyone to understand that coding agents may have sophisticated outcomes, but the raw interaction isn't magical at all.
It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.
- Loved this writeup. I have built an agent for a specific niche use case for my clients (not a coding agent) but the principles are similar. ive only implemented 1-4 so far. Going to work on long term memory next, but I worry about prompt injection issues when allowing the LLM to write its own notes.
Since my agent works over email, the core agent loop only processes one message then hits the send_reply tool to craft a response. Then the next incoming email starts the loop again from scratch, only injecting the actual replies sent between user and agent. This naturally prunes the context preventing the long context window problem.
I also had a challenge deciding what context needs injecting into the initial prompt vs what to put into tools. Its a tradeoff between context bloat and cost of tool lookups which can get expensive paying per token. Theres also caching to consider here.
Full writeup is here if anyone is interested: https://www.healthsharetech.com/blog/building-alice-an-empow...
- > This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.
People have been doing that for over a year already? GLM officially recommends plugging into Claude Code https://docs.z.ai/devpack/tool/claude and any model can be plugged into Codex CLI (it's open source and can be set via config file).
by zbyforgotpass
4 subcomments
- Isn't there a better word than harness? I understand the metaphor of leading and constraining a raw power - but I don't like it.
- Tool output truncation helps a lot and is one of the best ways to reduce context bloat. In my coding agent the context is assembled from SQLite. I suffix the message ID to rehydrate the truncated tool call if it’s needed and it works great.
My exploration on context management is mostly documented here https://github.com/hsaliak/std_slop/blob/main/docs/CONTEXT_M...
- > This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.
Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.
https://docs.z.ai/scenario-example/develop-tools/claude
It doesn't perform on par with Anthropic's models in my experience.
- Strong article! I’ve been using the engine/car analogy for a while now.
If you want to play with the basic building blocks of coding agents, check out https://github.com/OpenHands/software-agent-sdk
- Compounding is probably the break point, one agent's output is another agent's input, does the garbage in garbage out rule apply?
by crustycoder
0 subcomment
- A timely link - I've just spent the last week failing to get a ChatGPT Skill to produce a reproducible management reporting workflow. I've figured out why and this article pretty much confirms my conclusions about the strengths & weaknesses of "pure" LLMS, and how to work around them. This article is for a slightly different problem domain, but the general problems and architecture needed to address them seem very similar.
by oortcrate_1
0 subcomment
- Totally agree. Chat history feels like a side effect, not a source of truth. Having an explicit markdown file for goals and constraints has been a game changer for my workflow. It turns out you don't need a complex setup; you just need the agent to be explicit about what it’s doing and why.
- The useful framing here is that coding agents get better less from raw model gains and more from better scaffolding around the model. Once you give them tools, repo context, and a simple state machine, the bottleneck shifts to context qual
- I will also leave this here
https://github.com/shareAI-lab/learn-claude-code/tree/main/a...
I found it excellent in explaining a CC-like coding agent in layers.
- Awesome Read!!
- [dead]
- [dead]
- [dead]
- [flagged]
- [dead]
by Sim-In-Silico
0 subcomment
- [dead]
by techpulselab
0 subcomment
- [dead]
by Adam_cipher
0 subcomment
- [flagged]
- [flagged]
by jeremie_strand
0 subcomment
- [dead]
- [flagged]
by techpulselab
0 subcomment
- [dead]
- [dead]
- [dead]
- [dead]
- [flagged]
by LeonTing1010
0 subcomment
- [flagged]
- [dead]
by aplomb1026
0 subcomment
- [dead]
- [dead]
- [dead]
- [dead]
by volume_tech
0 subcomment
- [dead]