FRESH

Hacker News

Home

Show HN: Statewright – Visual state machines that make AI agents reliable

126 points by azurewraith

by azurewraith

0 subcomment

Here's a week 2 update. a lot shipped since this post...
OSS: The `engine` and `agent` crates are now fully open (Apache 2.0). only the `gateway` and the various plugins are FSL with a 3-year clock. You can run the full state machine locally, self-hosted, no cloud dependency. The end-to-end workflow works out of the box with ollama and any 13B+ model now.
Multi-agent support: I validated and enhanced plugins for Codex CLI, Oh-My-Codex, and Pi alongside the existing Claude Code plugin. Same gateway, same workflows, different agent frontends. the Pi plugin in particular is interesting because pi's extension API supports things Claude Code doesn't yet... programmatic model switching and per-state tool filtering where the model literally never sees disallowed tools.
Interrupts: Reactive file triggers that force state transitions. model edits a migration file? interrupt fires, pulls it into a review state. it uses the History State pattern to return where it was in the state machine after.
Fork/join: Parallel sub-agent execution. planning state dispatches N implementation branches each in their own worktree, join collects results.
Allowed_commands: Per-state bash command restrictions. testing state can run pytest but not rm -rf. enforced in the hook, not the prompt.
Tangentally, the Forge post this week (https://news.ycombinator.com/item?id=48192383) validated the same thesis from a different angle... structural guardrails on small models outperform unconstrained frontier models. three independent projects converging on "the harness is first-class infrastructure" in roughly two weeks.
Next on the agenda: per-state model routing. use a local 12B for grunt work, route to Opus/GPT-5 for the one call that matters. the cost math is trending towards ~80% reduction on a 6-phase workflow

by embedding-shape

3 subcomments

I wanted to try to reproduce the research results (https://github.com/statewright/statewright#research-results) locally but I wasn't able to find the code for it, have you publish the code for running those somewhere?
The research page (https://statewright.ai/research) mentions a patent, and a "core engine";
> Provisional patent application filed: #64/054,240 (April 30, 2026). 35 claims covering state machine guardrail enforcement for LLM agent tool access. The core engine remains Apache 2.0 open source.
I'm not sure I understand what the "core engine" is if it's not the "state machine guardrail runtime" which is what the patent cover. What parts are the open source parts exactly?
I find the idea really interesting and was nodding along the way as I read what you wrote, makes sense both for the human and the agent, seems like a really nice idea that'd help, but the patent kind of makes me want to run away and not look into it too deeply.

by giancarlostoro

1 subcomments

Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!
In any case, I'll have to check out Statewright after work ;)

by redhale

1 subcomments

I feel like caching should be mentioned in tradeoffs, right? If you change the tool list frequently, that's a cache bust. In long sessions that seems like it could significantly affect costs.

by tim-projects

1 subcomments

I'm fully convinced that state machines are the key to getting low powered llm models to produce good quality code.

by DeathArrow

2 subcomments

First thought: But why do we need statewright.ai external api? Why can't we do everything locally?
Second thought: enforcing tools is useful and I built myself a Pi extension to deny access to particular tools in some workflows.
But we need somehow to force agents obey the rules.
For example I have rules when using Pi to ask main agent to dispatch implementer agents in parallel using git worktrees. Some time it uses git worktrees, sometimes not.
The thoughts are like this: "the user asked me to use git worktrees so let me start using git worktrees. But wait, the task is simple so maybe I don't need git worktrees..."
If I ask why it didn't follow the rules, it says something like: "The user is right, I should have followed the rules..."

by addaon

1 subcomments

I’ve been using a pattern similar to this with near-frontier models to solve problems harder than coding. Structurally things are even more extreme — no tool calling allowed. Each state gives structured output that the harness then uses to derive the next state and context. So a context in one state may say “you have these lemmas with definition visible, and these by name in other files”; the agent from a certain state can consume the visible lemmas, but can also modify includes to get visibility into and ability to use other lemmas after iteration. So far, seems sane, but haven’t benchmarked on this problem against more free-form solutions.

by 2001zhaozhao

1 subcomments

Interesting.
In your Github, the JSON format shown for defining custom workflows is very simple. I wonder if that limits the detail in the state-related instructions and error messages you can send to a model.
For example, in state transitions, does your tool just tell the model something like "you are in 'act' mode and no longer in 'plan' mode, here are your new available tools"? Seems difficult to give it any more informative messages given how simple the workflow definitions are. Likewise when the model attempts to do something that's not supported for tools in the given phase.

by tecoholic

1 subcomments

Very cool idea. I had something vaguely similar in my mind. It's nice one see go ahead and implement it. All the Claude code animations and not knowing what's happening, how long it will take and what will come out is really frustrating me. On top of that there is no way to actually limit the scope of things. Opencode's Plan mode and build mode helps a bit.
If a state machine can improve a local LLM to produce better results, it's welcome addition to tinkerers and solo devs.

by nextaccountic

1 subcomments

In https://github.com/statewright/statewright/blob/main/docs/im...
what's the difference between a "transition" (purple line, not shown in the workflow) as opposed to happy path / failure?

by esafak

2 subcomments

I just have a smart model write a testable phased plan, have a cheaper model implement them, and yet another model to review each phase. I don't see the value of adding a Rust state engine. Algorithmically verifiable things can be tests, and more nebulous things (like pattern compliance) need an LLM to do the heavy lifting and can make mistakes, so what does the state engine buy you?

by dlfelps

1 subcomments

How do you force the agent to use an MCP server? From my experience it can be tricky to get an agent to call an MCP server in the first place. This must require something beyond instructions in the CLAUDE.md.

by password4321

1 subcomments

Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.

by miki_tyler

1 subcomments

Very nice project!
Is the editor/composer separate from the runtime?
If I build a workflow in the visual editor, can I use that same flow inside my own app just by using the runtime/engine? Or is it mainly tied to the Statewright platform and Claude Code plugin?
I’m wondering if the runtime can be used as a standalone piece to power apps I build.

by fizza_pizza

1 subcomments

This actually makes a lot of sense. Feels like most people are trying to brute force reliability with bigger models while you’re reducing the problem space instead. “Agents are suggestions, states are laws” is such a good line too.

by jasonli0226

0 subcomment

Cool idea. Any early data on how much this reduces token usage / cost in practice on real workflows?

by prunrCloud

1 subcomments

Really interesting approach. My only concern would be how much flexibility gets lost when workflows become too rigid. Curious how it performs on tasks that require more creative exploration.

by aitchnyu

1 subcomments

My Kilocode has error messages like "you have called edit for a file you have not read". Did you make an evolved version of this?

by dataworth

0 subcomment

Visualizing agentic problem solving is a really cool concept. Feels like something I’ve seen on TV or something before. I like it.

by davidkpiano

1 subcomments

Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!

by chris_st

1 subcomments

Please add support for the Windsurf editor as well. Thanks!

by brainless

2 subcomments

I have to check how you are using state machines but I have also been focused on small models for a while now.
nocodo is one of my product experiments, currently using 120B model but I have tested a few agents inside it with 20B models.
I create a bunch of agents, each with very specific goals. Like Project Manager, Backend Engineer, etc.
Each agent gets a very compact list of tools and access to only certain parts of the filesystem or commands.
https://github.com/brainless/nocodo/tree/main/agents/src

by azurewraith

0 subcomment

Hey it's me again. Some things that didn't fit in the README or the original post -- less about features, more about where this goes.
The plan/implement/test workflow is very basic and represents the most common agentic use case. But the state machine pattern applies to any multi-step work where agents are useful but susceptible to death spirals, hallucinations, or other non-deterministic quirkiness. This also enables Claude Desktop and other non-coding agents to perform useful constrained work.
I've been building a content pipeline for tabletop publishing and tested it a bit earlier yesterday. A research phase gathers lore and game details from a compendium, a drafting phase generates structured content including schema-specific JSON validation (so my Lua+LaTeX templates work without iterating). A review gate has me editing content directly (tmux+neovim dialog is great for this). The agent shapes the content, makes sure it conforms to JSON validation and content requirements, then I write it. Before I adapted the state machine to it, the agent tried to do everything all at once — calling multiple agents is sometimes effective but details get lost and you definitely lose visibility in the summarization. The state machine runs everyone serially (for now) but chaining and parallelization are on the roadmap.
While working with statewright on a different workflow over the weekend and Claude (as Claude does) attempted to write an intricate bash script to work around a guardrail... and statewright blocked it! I think that was when I knew there was some real power behind what's been built here. Enforcement has to be structural, not advisory.
Also, being generally useful for things besides coding you can start to think about things like SOC 2 change management. Every change needs a plan, a human review gate, audited implementation, pull request, review, human approval, and then finally a human to approve a production deployment. Today teams enforce this with checklists and hope. An agent constrained by a workflow that won't let it deploy without all the prerequisite pieces is enterprise delivery with an auditable paper trail and humans injected for approvals where they need to be - not managing each change's lifecycle.
The piece I'm most excited about is agent-generated workflows. You solve a problem once and maintain your context, then point the agent at the JSON schema and it creates and uploads a new workflow to statewright automatically that you can use immediately. No fine-tuning, no exhaustive prompt engineering, no dozens of agents... best-fit lightweight guardrails that agents help build themselves, compiling your intent into structure the models can't weasel their way out of. This is a fundamentally different reality than what the current state of the art is practicing. I think that's a big deal.

by madikz

0 subcomment

[flagged]

by kcarriedo

0 subcomment

[flagged]

by voidstitch

0 subcomment

[dead]

by squid-protocol

0 subcomment

[dead]

by Bret_McKinney

0 subcomment

[flagged]

by hiroto_lemon

0 subcomment

[flagged]

by reiter

0 subcomment

[flagged]

by MehdiBelkacem

0 subcomment

[dead]

by theuniverseson

0 subcomment

[flagged]

by GhostDriftInc

0 subcomment

[flagged]

by rpbaquing

0 subcomment

[flagged]

by Bmello11

0 subcomment

[flagged]

by quantumadopter

0 subcomment

[flagged]

by Phionyx

0 subcomment

[flagged]

by ldaniel_ships

0 subcomment

[flagged]

by tommy29tmar

0 subcomment

[flagged]

by Regina0727

0 subcomment

[dead]

by implexa_founder

0 subcomment

[flagged]

by aicodeprompts

0 subcomment

[flagged]