You can probably find games where that's not true, as people are still releasing text adventure games occasionally.
I tried basic raw long-context chat, various approaches of getting it to externalize the state (i.e. prompting it to emit the known state of the maze after each move, but not telling it exactly what to emit or how to format it), and even allowing it to emit code to execute after each turn (so long as it was a serialization/storage algorithm, not a solver in itself), but it invariably would get lost at some point. (It always neglected to emit a key for which coordinate was which, and which direction was increasing. Even if I explicitly told it to do this, it would frequently forget to at some point anyway and get turned around again. If I explicitly provided the key each move, it would usually work).
Of course it had no problem writing an optimal algorithm to solve mazes when prompted. In fact it basically wrote itself; I have no idea how to write a maze generator. I thought the disparity was interesting.
Note the mazes had the start and end positions inside the maze itself, so they weren't trivially solvable by the "follow wall to the left" algorithm.
This was last summer so maybe newer models would do better. I also stopped due to cost.
Switchable backends, various output formats, etc.
In theory, I could also likely wire this up to get it playing MUDs, but I have some reservations about running that on anything except a private server.
My use case for this is to help test and evaluate Interactive Fiction in development, and you could even run it as a CI/CD process.
It's not perfect (so much Claude Coding of this), but it's an ok start for an hour on the couch: https://github.com/tibbon/gruebot
Context is intuitively important, but people rarely put themselves in the LLM's shoes.
What would be eye-opening would be to create an LLM test system that periodically sends a turn to a human instead of the model. Would you do better than the LLM? What tools would you call at that moment, given only that context and no other knowledge? The way many of these systems are constructed, I'd wager it would be difficult for a human.
The agent can't decide what is safe to delete from memory because it's a sort of bystander at that moment. Someone else made the list it received, and someone else will get the list it writes. The logic that went into why the notes exist is lost. LLMs are living the Christopher Nolan film Memento.
Everything the author says about memory management tracks with my intuition of how CC works, including my perception that it isn't very good at explicitly managing its own memory.
My next step in trying to get it to work well on a bigger game would be to try to build a more "intuitive" memory tool, where the textual description of a room or an item would automatically RAG previous interactions with that entity into context.
That also is closer to how human memory works -- we're instantly reminded of things via a glimpse, a sound, a smell... we don't need to (analogously) write in or search our notebook for basic info we already know about the world.
It did so well that I can't not suspect that it used some hints or walkthroughs, but then again it did a bunch of clueless stuff too, like any player new to the game.
For one thing, this would be a great testing tool for the author of such a game. And more generally, the world of software testing is probably about to take some big leaps forward.
Related HN post from about 6 months ago
Evaluating LLMs Playing Text Adventures
“What persists when all you have is language, rules, and time?”
As they are limited - alike the text adventures - in their interactivity with the game world, AI could be used to enhance this.
I read a game magazine some 25 years ago where an editor went on the question what would be the perfect game for him: one, where every action is possible.
As it's still happening in a limited space (the game world) the possible actions are somehow limited to make this work realistically.
Edit: they are there in the repo: https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...
Leave Claude grind smurf tokens on your phone while you sleep.
I wonder if the improvements due to different memory system approaches apply in a similar way to tasks that are in its training history vs those that are not.
It.. kind of works
MOOLLM Repo:
https://github.com/SimHacker/moollm/blob/main
The Eval Incarnate Framework:
https://github.com/SimHacker/moollm/blob/main/designs/eval/E...
Text Adventure Approaches:
https://github.com/SimHacker/moollm/blob/main/designs/text-a...
This is a practical attempt to make memory and world state explicit, inspectable, and cheap.
Quick replies to a few points:
@nitwit005 / @pflenker / @woggy / @zetalyrae
Training data could include transcripts, but salience is weak. The bigger failure mode I’ve seen is harness design: long transcript vs. structured state. I try to make it observable and auditable so it doesn’t rely on accidental recall.
@mnky9800n / @apples_oranges / @throwway262515 / @falcor84
I don’t think “return to symbolic AI” is a retreat. It’s scaffolding. LLMs do the fuzzy interpretation, but the symbolic layer keeps state, causality, and constraints visible. MOOLLM’s bias is “combine, don’t choose.”
@daxfohl
Backtracking in mazes is exactly why I externalize geometry and action affordances. If the map is a file and exits are explicit, the agent can stop re‑discovering dead ends. It also separates “solve the maze” from “play the maze.”
@lukev / @skybrian / @twohearted / @fragmede
Agreed: memory is a tool, not a dump. I use file‑based memory types (characters, maps, rooms, inventory, goals, episodic summaries) and explicit affordances (cards with The Sims style "advertisements", like CLOS generic dispatch meets Self multiple prototypical inheritance). It’s closer to “human with tools” than “human on a whiteboard.”
@CephalopodMD / @wktmeow / @imiric
I think periodic summaries + structured memory work better than full transcript reuse. Cache helps with cost, but structure helps with reasoning. If a model can ask for “full history” occasionally, that’s a nice escape hatch.
The cursor-mirror skill can search and query text chats and sqlite databases that Cursor uses to store chat state as structured intertwingled data.
https://github.com/SimHacker/moollm/tree/main/skills/cursor-...
The thoughtful-commitment skill composes with the cursor-mirror tool to reflect on the cursor chat history, and write git commit messages and prs that relate cursor activity, prompts, thinking, file editing, and problem solving with git commits -- persisting transient cursor state into git comments explaining the prompts and thoughts and context that went into each commit.
https://github.com/SimHacker/moollm/tree/main/skills/thought...
@tibbon / @brimtown / @kaiokendev
Love these experiments. I’m trying to make the harness composable in Cursor with inspectable context, so you can understand why it did what it did. That’s where cursor‑mirror fits.
@PaulHoule
Yes. Models are trained on transcripts, not on custom memory tools. So the memory tool has to be shaped like a game object—explicit state, small interface, clear affordances. If you want to see the Cursor introspection tooling: skills/cursor-mirror/ in the repo. It shows what the agent reads and edits, what tools fired, and how context was assembled.
On the visual side, I documented the full “vision feedback stack” in a session with a simulation of Richard Bartle (MUD, Bartles taxonomy of player types, Designing Virtual Worlds. (He graciously gave his consent to be respectfully simulated!)
https://en.wikipedia.org/wiki/Richard_Bartle
MUD1 at Essex University:
https://en.wikipedia.org/wiki/MUD1
Bartle Taxonomy of Players:
https://en.wikipedia.org/wiki/Bartle_taxonomy_of_player_type...
Visual Pipeline Demonstration:
https://github.com/SimHacker/moollm/blob/main/examples/adven...
It’s a multi‑stage symbolic/visual loop: narrative → incarnation → ads → prompt crystallization → prompt synthesis → render → context‑aware mining → YAML‑fordite layering → slideshow synthesis → photos as actors → slideshows as coherent illustrated narrative.
The key idea is “YES, AND” from improvisational theater: generated images become canon, mining extracts coherent meaning, and the slideshow locks the narrative and visual continuity for future turns.
https://en.wikipedia.org/wiki/Yes,_and_...
https://www.youtube.com/watch?v=FLhV7Ovaza0
Full write‑up: Visual Pipeline Demonstration.
Visual Pipeline Demo to Simulated Familiar of Richard Bartel (MOO):
https://github.com/SimHacker/moollm/blob/main/examples/adven...
Slideshow Index:
https://github.com/SimHacker/moollm/blob/main/examples/adven...
Master Synthesis Slideshow (threading multiple parallel slideshows happening at the same time):
https://github.com/SimHacker/moollm/blob/main/examples/adven...
Err, what?
This behavior surprised me when I started using LLMs, since it's so counterintuitive.
Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.
Why are all these relatively simple engineering problems still unsolved?