This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.
Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:
---- ## Python Code Example:
ready = runtime.assert_( all_of(url_contains("checkout"), exists("role=button")), "checkout_ready", required=True )
----
If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.
Happy to answer questions about the design tradeoffs or where this approach falls short
We've been building agent-based automation and the reliability problem is brutal. An agent can be 95% accurate on each step, but chain ten steps together and you're at 60% success rate. That's not usable.
Curious about the failure modes though. What happens when the verification itself is wrong? Like, the cart shows updated on screen but the verification layer checks a stale element?
* When you "run a WASM pass", how is that generated? Do you use an agent to do the pruning step, or is it deterministic?
* Where do the "deterministic overrides" come from? I assume they are generated by the verifier agent?
What I find most compelling about this approach is the explicit verification layer. Too many browser automation projects fail silently or drift into unexpected states. The Jest-style assertions create a clear contract: either the step definitively succeeded or it didn't, with artifacts for debugging.
This reminds me of property-based testing - instead of hoping the agent "gets it right," you're encoding what success actually looks like.
I think using a logical diff to do pass/fail checking is clever, though I wonder if there are failure modes there that may confuse things, such as verifying highly dynamic webpages that change their content even without active user interactions.
What exactly is importance ranking? Does the verification layer still exists without this ranking?
1. Planner (Write a failing test or tests) 2. Executor (Generate a solution) 3. Verifier (Until the tests no longer fail) 4. Repeat