The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)
Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.
The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.
The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.
Let's say that you have two agents running concurrently: A & B. Agent A decides to push a message into the context of agent B. It does that and the message ends up somewhere in the list of the message right at the bottom of the conversation.
The question is, will agent B register that a new message was inserted and will it act on it?
If you do this experiment you will find out that this architecture does not work very well. New messages that are recent but not the latest have little effect for interactive session. In other words, Agent A will not respond and say, "and btw, this and that happened" unless perhaps instructed very rigidly or perhaps if there is some other instrumentation in place.
Your mileage may vary depending on the model.
A better architecture is pull-based. In other words, the agent has tools to query any pending messages. That way whatever needs to be communicated is immediately visible as those are right at the bottom of the context so agents can pay attention to them.
An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.
I hope this helps. We've learned this the hard way.
This means:
- less and less "man-in-the-loop"
- less and less interaction between LLMs and humans
- more and more automation
- more and more decision-making autonomy for agents
- more and more risk (i.e., LLMs' responsibility)
- less and less human responsibility
Problem:
Tasks that require continuous iteration and shared decision-making with humans have two possible options:
- either they stall until human input
- or they decide autonomously at our risk
Unfortunately, automation comes at a cost: RISK.
The system I’ve developed for this is open source and detailed at https://airut.org
I still sit and watch my terminals. It's the easiest way to catch problems.
Yes you can - durable objects do exactly what the "Ably pub/sub channel transport" diagram describes. And it's even easier with the cloudflare agents SDK. This article strawmans the capabilities of competing infra.
It works with multiple LLM’s. The main downside is that since they go through the API, it gets expensive once the monthly quota runs out. (They claim to resell additional API usage at cost, but that doesn’t seem easy to verify.) I’ve switched to using Sonnet for most things but haven’t experimented with cheaper models yet.
It seems like the big price difference between what going through the API costs and what you can get via a subscription is really holding things back.
- The agent and all its state stays on a persistent server that saves state on restart
- Just stream the state directly to the client via websockets, or even the entire UI with something like liveview
OpenClaw has already proven this model and I don't see a great reason to try and solve the problem a different way.
https://developers.openai.com/api/docs/guides/websocket-mode
I have been building on it over the past month holding WebSocket sessions on workers warm, and command routing using NATS JetStream. With this, it has made using sidecar threads for a main thread very simple, as the worker treats them similar.
I'm kidding of course but feels like the time has come to look closely into Erlang ecosystem and OTP.
There's even agentic framework for this: https://jido.run/blog/jido-2-0-is-here
If you think about it, OTP makes a lot of sense for always-on, reachable agents. Agents need to talk to external systems all the time: web services, databases, message queues, local tools.
More than a year ago, I had the idea of building a personal AI assistant connected to multiple services (https://github.com/konovalov-nk/synaptra/blob/main/docs/arch...). But I didn't want to build yet another over-engineered k8s setup just to get isolation and separation of concerns.
Over time, I realized OTP was much closer to the model I actually wanted.
Why?
Some services want to run locally: memory, low-latency text-to-speech, private data access. The agent can also run locally while delegating work across supervised processes. Things will fail, and that's fine — Erlang was built around exactly that assumption.
Once you look at agents this way, they indeed look less like chat sessions and more like long-lived, supervised, stateful processes.
In that sense, Erlang really was ahead of its time.
Once I hashed canonical input JSON, cache hit rate on real traffic was higher than expected — mid-teens % once a handful of workers were live. Curious if anyone here's tried cross-agent result sharing without bolting on a full pub/sub layer.
Even if I can string it together it's pretty fragile.
That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)
I vibe coded a message system where I still have all the chat windows open but my agents run a command that finished once a message meant for them comes along and then they need to start it back up again themselves. I kept it semi-automatic like that because I'm still experimenting whether this is what I want.
But they get plenty done without me this way.
I don't think it solves the other half of the problem that we've been working on, which is what happens if you were not the one initiating the work, and therefore can't "connect back into a session" since the session was triggered by the agent in the first place.
The only place I use async now is when I am stepping away and there are a bunch of longer tasks on my plate. So i kick them off and then get to review them when ever I login next. However I dont use this pattern all that much and even then I am not sure if the context switching whenever I get back is really worth it.
Unless the agents get more reliable on long horizon tasks, it seems that async will have limited utility. But can easily see this going into videos feeding the twitter ai launch hype train.
As an aside, I've built and deployed a production system in which disconnecting & reconnecting from an in-progress LLM stream works and resumes from wherever the stream currently is, through a combination of redis/valkey & websockets - it's not all that hard, it turns out!
> So how are folks solving this?
$5 per month dedicated server, SSH, tmux.
Maybe better somebody standardize that because we'll end up with agents sending rich payloads between themselves via telegram.
Having long living requests, where you submit one, you get back a request_id, and then you can poll for it's status is a 20 year old solved problem.
Why is this such a difficult thing to do in practice for chat apps? Do we need ASI to solve this problem?