I'd run the following 5-10 times with one model, then again with a 2nd model.
"Verify the correctness and completeness of all security configs/rules in SETUP.md. Consider if anything is missing, and if anything is not needed. Do not modify any files; only write potential findings to report.txt"
"Verify all findings and claims in report.txt."
Replace "SETUP.md" with whatever you're working on.
It's both terrifying and incredible watching what the models get correct and what they get completely wrong.
However, after enough runs they tend to settle on a state they claim does not need any more edits. And that result is generally useful with much fewer errors/hallucinations compared to a single run.
FWIW, my quick impression is that takes reasonable concepts and tries to formalize them into a framework; I can see potential benefits, I've certainly asked in a claude code session for it to have a look at pipeline so and so and figure out the issue, but I'm not really convinced by this at first glance either. Both setup-cost and token cost seem like downsides.
A good harness constrains the action surface, context, and task boundaries. An agent’s failure isn’t always due to “writing incorrect code” — it can also result from “doing things it wasn’t supposed to do.” Tests and lints can verify part of the correctness, but they often fail to validate task scope. A well-designed harness should shift the review process from “reading the entire diff” to “verifying whether the changes stay within the defined task boundaries.”
I did a quick look at the content, and it seems verbose and AI generated but conceptually OK. I learn by tinkering, not a good fit for me, but if you learn by reading, maybe this is for you.
My view is that human time is more precious than computer time. If something can be automated, then automate it. I don't lint code by hand, I get the linter to do it. Similarly, LLMs expand the list of things that computers can do. That's what you get from the harness, however you learn to do it.
Why not just look through the actual Claude code codebase and use your own AI to deconstruct it