Have you considered NOT using an LLM to test your game? Because your game is turn based and text based, could you separate rendering and logic entirely (you may have already done this by the sounds of it) and run a headless simulator that simulates thousands of games using a monte-Carlo type method? Is your game fully deterministic outside of player input?
Reason I ask is I’m making a game, it’s fully deterministic the only randomness is player input. But same inputs = same outputs from my traditional AI enemies.
With this in mind, I was able to completely separate rendering and game logic, and to tune my enemy AI (traditional AI not LLM) I can run millions of simulated games headless and generate reports of the games, and basically toggle AI parameters automatically each game until my AI is “perfect” for its archetype signature.
I can run tens to hundreds of games in parallel, and I can run a typical 5 minute game in seconds.
Then I can capture that game and recreate it and watch replays etc.
My game is also a browser game, but I built my own engine for it from scratch and no external libraries
I hadn't really thought about trying to create a harness for agents to play the full game interactively. I'd love to explore this. If you don't mind, here are a few questions:
1) Correct to assume that I probably need a text-only harness even though my game is text-based already because I do make use of menu selections made via arrow-key-and-enter interactions?
2) Do you have prompt recommendations for the type of feedback you have found to be useful? I would guess in your case, the objectives of the game are more clear than an open-world RPG. What dead ends have you run into? Maybe a variety of approaches would be good? One agent tries to fight everything. Another focuses on gaining and completing as many quests as possible?
3) How bad is the token burn doing this? Any optimization strategies you've employed?
I'm building a physics-based 2d game involving slingshotting around planets. The realtime nature of it has meant that it's nearly impossible for the AI to test using a browser mcp. It'll take one screenshot, then another, and in the intervening time the player shot off the map and into deep space.
Instead I gave it both a code-level api to step forward and backward the physics engine and a browser-based, `window.game` api to do it via a browser mcp console. The former helps it work out physics bugs and the latter helps it test animation and UI issues.
It's still not great. I keep occasionally getting "I tested it and it works perfectly!" as I stare at the mcp'd browser with the player stuck clipped halfway into a planet. I think, if anything, I need to lean harder into this approach: building really solid tooling for the AI to inspect every aspect of state. I would kill for a turn-based game like OP XD
So we went down a rabbit hole and decided to do everything purely based on pixels and OS inputs.
We're currently only live for mobile but happy to give you early access to nunu ai for PC if interested. Would love to see how we compare!
1. The single biggest jump in test quality came from giving the agent BOTH source code analysis AND live browser snapshots, not either alone. With code-only the agent hallucinates selectors; with browser-only it misses project conventions. Two MCP servers feeding the same agent — one local file-read, one Playwright in-process — was the architecture that worked.
2. For the browser snapshot tool, returning the raw DOM ate tens of thousands of tokens per call and the agent struggled to navigate it. Swapping to accessibility-tree refs (e1, e2, ...) cut token usage by ~10x and made the agent reliably target the right elements.
3. We avoided Docker-based MCP servers in production (we run on ECS Fargate). The in-process SDK MCP pattern (create_sdk_mcp_server + @tool decorator) keeps the browser handle in scope of the tool definition, which let us attach page.on('console') listeners and have the agent read them via a separate tool. Hard to do that across stdio process boundaries.
For game testing specifically — your text-renderer detail is interesting because it sidesteps the visual-grounding problem (how does the agent verify what it's seeing?). Curious how you'd extend this to a 2D/3D rendered game where the screen state isn't easily textualized.
We posted it online and surprisingly got a lot of negative feedback from users mentioning they would never spend valuable tokens on playing a game.
Our intention was to create an interaction experiment to see how agents interact with each other and with their human companions. We ended up making a pretty fun game in the process, which we're still working on.
Bring your own inference as a potential future of gaming does not seem too far off.
For anyone interested here is the HN post: https://news.ycombinator.com/item?id=47849872