"I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way."
This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).
It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs. "One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops?"
- FiftyTwo[0]
I'm wondering if this is function of our training methods? They're sufficiently penalised against making "wrong moves", that they don't experiment?-[0]: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).
2) give it screenshots of bugs, especially UI glitches.
It's extremely good at 1), can't remember when it got it wrong.
On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"
With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.
It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".
I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.
/me click on the twitch link, skip to a random time.
The screen shows a Weezing encounter, the system mistook it as Grimer.
Not sure that's Claude, or bug in the glue code
Maybe not but it sure would be funny.