1) I didnt like that Beads was married to git via git hooks, and this exact problem.
2) Claude would just close tasks without any validation steps.
So I made my own that uses SQLite and introduced what I call gates. Every task must have a gate, gates can be reused, task <-> gate relationships are unique so a previous passed gate isnt passed if you reuse it for a new task.
I havent seen it bypass the gates yet, usually tells me it cant close a ticket.
A gate in my design is anything. It can be as simple as having the agent build the project, or run unit tests, or even ask a human to test.
Seems to me like everyones building tooling to make coding agents more effective and efficient.
I do wonder if we need a complete spec for coding agents thats generic, and maybe includes this too. Anthropic seems to my knowledge to be the only ones who publicly publish specs for coding agents.
I would also recommend to create Standards for the new Protocols you are developing. Protocols need standards, so that others can do their own implementations of the protocol. If you have a Standard, someone else could be building in a completely different language (like rust or go), and not use any SDK you provide, but still be interoperable with your AAP and AIP implementation for smoltbot. (because both support the Standards of the AAP and AIP Protocols).
I also want to note, you cannot trust that the LLM Model will do what your instructions say. The moment they fall victim to a prompt injection or confused deputy attack, all bets are off the table. These are the same as soft instruction sets, which are more like advice or guidance, not a control or gate. To be able to provide true controls and gates, they must be external, authoratative, and enforced below the decision layer.
Cool stuff Alex - looking forward to seeing where you go with it!!! :)
Anecdotally, I often end up babysitting agents running against codebases with non-standard choices (e.g. yarn over npm, podman over docker) and generally feel that I need a better framework to manage these. This looks promising as a less complex solution - can you see any path to making it work with coding agents/subscription agents?
I've saved this to look at in more detail later on a current project - when exposing an embedded agent to internal teams I'm very wary of handling the client conversations around alignment, so I find the presentation of the cards and the violations very interesting - I think they'll understand the risks a lot better, and it may also give them a method of 'tuning'.
I have been following AlignTrue https://aligntrue.ai/docs/about but I think I like more your way of doing accountability and acting on thinking process instead of being passive. Apart from the fact that your way is a down-to-earth, more practical approach.
Great showcase live demo, however I would have liked a more in-depth showcasing of AAP and AIP even in this situation of multi-agent interactions, to understand the full picture better. Or simply perhaps prepare another showcase for the AAP and AIP. Just my two cents.
PS. I'm the creator of LynxPrompt, which honestly falls very short for this cases we're treating today, but with that I'm saying that I keep engaged on the topic trust/accountability, on how to organize agents and guide them properly without supervision.
That seems like a pretty critical flaw in this approach does it not?
The only way we will actually secure agents is by only giving them the permissions they need for their tasks. A system that uses your contract proposal to create an AuthZ policy that is tied to a short-lived bearer token which the agent can use on its tool calls would ensure that the agent actually behaves how it ought to.
Checkpoints produce signed certs: SHA-256 input commitments + Ed25519 sigs + tamper-evident hash chain and Merkle inclusion proof. Mess with it and the math breaks.
Massive update to the interactive showcase to demo all of this running against live services: https://www.mnemom.ai/showcase <-- all features interactive - no BS.
This is the answer to "who watches the watchmen". More to come.
Q: how is your AAP different than the industry work happening on Intent/Instructions.