This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.
Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.
The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.
I've scoped this out here [1] and here [2].
[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter
If I fork out a version for others that is public, then I have to maintain that variation as well.
Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.
it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.
I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.
For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
https://zarar.dev/spec-driven-development-from-vibe-coding-t...
I want a system that enforces planning, tests, and adversarial review (preferably by a different company's model). This is more for features, less for overall planning, but a similar workflow could be built for planning.
1. Prompt 2. Research 3. Plan (including the tests that will be written to verify the feature) 4. adversarial review of plan 5. implementation of tests, CI must fail on the tests 6. adversarial review verifying that the tests match with the plan 7. implementation to make the tests pass. 8. adversarial PR review of implementation
I want to be able to check on the status of PRs based on how far along they are, read the plans, suggest changes, read the tests, suggest changes. I want a web UI for that, I don't want to be doing all of this in multiple terminal windows.
A key feature that I want is that if a step fails, especially because of adversarial review, the whole PR branch is force pushed back to the previous state. so say #6 fails, #5 is re-invoked with the review information. Or if I come to the system and a PR is at #8, and I don't like the plan, then I make some edits to the plan (#3), the PR is reset to the git commit after the original plan, and the LLM is reinvoked with either my new plan or more likely my edits to the plan, then everything flows through again.
I want to be able to sit down, tend to a bunch of issues, then come back in a couple of hours and see progress.
I have a design for this of course. I haven't implemented it yet.
I think the secret sauce is talk to the model about what you want first, make the plan, then when you feel good about the spec, regardless of tooling (you can even just use a simple markdown file!) you have it work on it. Since it always has a file to go back to, it can never 'forget' it just needs to remember to review the file. The more detail in the file, the more powerful the output.
Tell your coding model: how you want it, what you want, and why you want it. It also helps to ask it to poke holes and raise concerns (bypass the overly agreeable nature of it so you dont waste time on things that are too complex).
I love using Claude to prototype ideas that have been in my brain for years, and they wind up coming out better than I ever envisioned.
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
Sometimes annoying - you can't really fire and forget (I tend to regret skipping discussion on any complex tasks). It asks a lot of questions. But I think that's partly why the results are pretty good.
The new /gsd:list-phase-assumptions command added recently has been a big help there to avoid needing a Q&A discussion on every phase - you can review and clear up any misapprehensions in one go and then tell it to plan -> execute without intervention.
It burns quite a lot of tokens reading and re-reading its own planning files at various times, but it manages context effectively.
Been using the Claude version mostly. Tried it in OpenCode too but is a bit buggy.
They are working on a standalone version built on pi.dev https://github.com/gsd-build/gsd-2 ...the rationale is good I guess, but it's unfortunate that you can't then use your Claude Max credits with it as has to use API.
[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...
But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.
Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.
I would imagine that for a non-engineer trying to code it would be quite useful / deliver a better result / less liable to end up in total mess. But for experienced engineers it quickly felt like overkill / claude itself just gets better and better. Particularly once we got agent swarms I left GSD and don't think I'll be back. But I would recommend it to non coders trying to code.
[1] https://github.com/ChristopherKahler/paul
[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...
If it was game engine or new web framework for example there would be demos or example projects linked somewhere.
If multiple people work with different AI tools on the same project, they will all add their own stuff in the project and it will become messy real quick.
I'll keep superpowers, claude-mem, context7 for the moment. This combination produces good results for me.
Is this supposed to run in a VM?
Claude code itself consumes lot of tokens when not needed. I have to steer it a lot while building large applications.
I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/
One pattern that's worked well for me: instead of writing specs manually, I extract structured architecture docs from existing systems (database schemas, API endpoints, workflow logic) and use those as the spec. The AI gets concrete field names, actual data relationships, and real business logic — not abstractions. The output quality jumps significantly compared to hand-written descriptions.
The tricky part is getting that structured context in the first place. For greenfield projects it's straightforward. For migrations or rewrites of existing systems, it's the bottleneck that determines whether AI-assisted development actually saves time or just shifts the effort from coding to prompt engineering.
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Honestly a fantastic harness right out of the box. Give it a good spec and it can easily walk you through fairly complex apps
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
Makes sense for consistency, but also shifts the problem:
how do you keep those artifacts in sync with the actual codebase over time?
Like most spec driven development tools, GSD works well for greenfield or first few rounds of “compound engineering.” However, like all others, the project gets too big and GSD can’t manage to deliver working code reliably.
Agents working GSD plans will start leaving orphans all over, it won’t wire them up properly because verification stages use simple lexical tools to search code for implementation facts. I tried giving GSD some ast aware tools but good luck getting Claude to reliably use them.
Ultimately I put GSD back on the shelf and developed my own “property graph” based planner that is closer to Claude “plan mode” but the design SOT is structured properties and not markdown. My system will generate docs from the graph as user docs. Agents only get tasked as my “graph” closes nodes and re-sorts around invariants, then agents are tasked directly.
Looked at profile, hasn't done or published anything interesting other than promoting products to "get stuff done"
This is like the TODO list book gurus writing about productivity
There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.
The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.
That's not a reason to stop trying. This is the iterative process of figuring out what works.
"I want to use 'get shit done' as part of my project"
These days, it's not a big deal at all at most places. But there are places where it will raise an eye brow. I'm not saying change it's name, and you've probably considered this already, but I would like to suggest the meaning of GSD tongue-in-cheek perhaps? Whatever, a kick-ass project either way.
Its already quite debatable whether software developers should be called software engineers, but this is just ridiculous.
I've been using a Claude Pro plan just as a code analyzer / autocomplete for a year or so. But I recently decided to try to rewrite a very large older code base I own, and set up an AI management system for it.
I started this last week, after reading about paperclip.ing. But my strategy was to layer the system in a way I felt comfortable with. So I set up something that now feels a bit like a rube goldberg machine. What I did was, set up a clean box and give my Claude Pro plan root access to it. Then set up openclaw on that box, but not with root... so just in case it ran wild, I could intervene. Then have openclaw set up paperclip.ing.
The openclaw is on a separate Claude API account and is already costing what seems like way too many tokens, but it does have a lot of memory now of the project, and in fairness, for the $150 I've spent, it has rewritten an enormous chunk of the code in a satisfactory way (with a lot of oversight). I do like being able to whatsapp with it - that's a huge bonus.
But I feel like maybe this a pretty wasteful way of doing things. I've heard maybe I could just run openclaw through my Claude Pro plan, without paying for API usage. But I've heard that Anthropic might be shutting down that OAuth pathway. I've also heard people saying openclaw just thoroughly sucks, although I've been pretty impressed with its results.
The general strategy I'm taking on this is to have Claude read the old codebase side by side with me in VSCode, then prepare documents for openclaw to act on as editor, then re-evaluate; then have openclaw produce documents for agent roles in Paperclip and evaluate them.
Am I just wasting my money on all these API calls? $150 so far doesn't seem bad for the amount of refactoring I've gotten, across a database and back and front end at the same time, which I'm pretty sure Claude Pro would not have been able to handle without much more file-by-file supervision. I'm slightly afraid now to abandon the memory I've built up with openclaw and switch to a different tool. But hey, maybe I should just be doing this all on the Claude Pro CLI at this point...?
Looking for some advice before I try to switch this project to a different paradigm. But I'm still testing this as a structure, and trying to figure out the costs.
[Edit: I see so many people talking about these lighter-weight frameworks meant for driving an agent through a large, long-running code building task... like superpowers, GSD, etc... which to me as a solo coder sound very appealing if I were building a new project. But for taking 500k LOC and a complicated database and refactoring the whole thing into a headless version that can be run by agents, which is what I'm doing now, I'm not sure those are the right tools; but at the same time, I never heard anyone say openclaw was a great coding assistant -- all I hear about it being used for is, like, spamming Twitter or reading your email or ordering lunch for you. But I've only used it as a code-manager, not for any daily tasks, and I'm pretty impressed with its usefulness at that...]
If I remember correctly, it created a lot of changes, spent a lot of time doing something and in the end this was all smoke and mirrors. If I would ever use something like this, I would maybe use BMad, which suffers from same issues, like Speckit and others.
I don't know if they have some sponsorship with bunch of youtubers who are raving how awesome this is... without any supporting evidence.
Anyhow, this is my experience. Superpowers on the other hand were quite useful so far, but I didn't use them enough to have to claim anything.