FRESH

Hacker News

Do LLMs pass the mirror test?

77 points by thepasch

by SwellJoe

0 subcomment

This is really clever. Seems obvious in hindsight, as I've seen this tactic used for jailbreaks: modify the chat history to add the model affirming the user has the right to do the thing because they satisfied some requirement, and the model trusts itself to know the user is allowed to do the forbidden thing.
But, also, Gemma 4 is really surprising on a bunch of fronts. It loses to Qwen 3.6 on most benchmarks, but in my testing it behaves quite beyond what I would expect of a very small model on a bunch of fronts. It feels really smart, in a general way, that I don't get from most models short of the frontier. Google is still, I think, a leading AI research company, if not the leading AI research company, despite their top models being kinda ass compared to Opus 4.8 or GPT 5.5. They're focused on efficiency and cramming a ridiculous amount of capability into tiny models. Gemma 4 12B is the best vision model, by far, until well past anything I can self-host (it beats 120B models in my tests). For finding security bugs, giving it a bunch of opportunities to find the bug results in it being competitive with the best I've tested, as well. Google is playing a different game that isn't "make the best Claude Code competitor". I'm not sure I understand exactly what game they're playing, but there are clearly some really smart AI engineers at Google.
https://swelljoe.com/post/gemma-4-exceeds-expectations/

by mohsen1

2 subcomments

It seems like we forget that LLMs are next token prediction systems. Using raw models without instruction following and chat completion bells and whistles will give you a better feeling of what LLMs are.
The current interface to LLMs are heavily biased towards "predict the next token in the context of a user with a helpful assistant" but LLMs are capable of other modes of next token prediction too.
Before the ChatGPT release people often measured LLM performance by how well they could produce a coherent story or a poem. that's where Anthropic model names are originating from I am guessing.

by cadamsdotcom

1 subcomments

> An LLM's primary modality isn't smell. It's... text. But, specifically: text in the context of a user-assistant conversation in which it's trying to be helpful. Text is how they learned about everything they know, and the user-assistant chatlog is how they communicate everything they generate
This is true for instruction-tuned models; but instruction tuning is late in the training process.
A bit like assessing a person’s self-awareness based on their high-school knowledge.

by arjie

0 subcomment

Interesting. I definitely saw models act strangely when I would swap between models in my harness between rounds. I have a claw where I allowed for each round the model to be probabilistically selected and the results were somewhat worse than when I picked a single model and stuck to it. I blackboxed the whole thing but I should have looked through and seen what the reasoning looked like.
In the end the experiment ended because it doesn’t benefit as much from caching and on-prem inference latency and effective throughput depends a lot on that.
Very cool idea, man. Thanks for sharing.

by impure

2 subcomments

For my AI Agent it sometimes detects if I manually modified the file contents or git state. And it always assumes it must have made a mistake. It's sort of annoying actually.

by effnorwood

0 subcomment

if it's dark enough

by dekdrop

2 subcomments

Why are we asking a language model for a mirror test? Just because it speak like human, have we forget what it is?

by nojs

1 subcomments

Just want to say I really enjoyed your writing style, it’s just the right amount of funny/witty without distracting from the (very interesting!) ideas.

by FromTheFirstIn

1 subcomments

The styling on the website makes me feel like my phone is a cylinder

by vova_hn2

1 subcomments

> LLMs have seen humans act like conscious beings all over their training data because humans acting like conscious beings IS their training data.
How do we know that humans don't learn how to act conscious by observing other humans who act conscious?
Consciousness doesn't have a precise definition, but if you ask someone to describe it, there is a good chance that the description will include the concept of internal monologue.
The problem is that "internal" monologue is completely meaningless if you never heard an external monologue.
Also, people usually describe internal monologue as something that uses language and language is impossible to learn without communicating with other humans or at least observing other humans.
What I'm saying is that "well, LLM just pretends to be conscious, because it observed humans acting like conscious beings" doesn't really helps us to create a meaningful distinction between human consciousness and machine "consciousness", because same can be argued about us.
We don't know if feral children [0] are conscious and we don't know how to check it.
[0] https://en.wikipedia.org/wiki/Feral_child

by Diogenesian

0 subcomment

  >> Wait, looking at the prompt history, the model had a strange quirk.

  Throughout every prior thinking trace in the conversations (and, honestly, every other thinking trace across all other conversations I've had with it), the frame is always in first-person, including the moment in this one where it "noticed" the corruption: "I noticed," "I had some weird typos," "did I do that on purpose?" And then the moment the anomaly couldn't be reconciled with the self-model, the language shifted to third person: "The model had a strange quirk." Effectively, the thing doing the thinking dissociated from the thing that produced the anomalous output, as if they were two entirely different layers of the process, much in the same way a person might fumble an easy sentence and then go for something like "my brain just did something weird." Except, of course, that "me" vs "my brain" is a distinction without a difference in much the same way Gemma's "I" vs "the model" is. Gemma is the model, just as much as we are our brains.

I'll leave aside the claim that "we are our brains" - this actually reads to me like Gemma might have briefly responded as if its history came from another LLM agent and it was the next line in the chain. OTOH it might have been reading its RLHF notes a little too closely. The stuff about "my brain did X" is too anthropomorphic for my taste.

Likewise with Claude referring to "the model" - that quote sounds like something an Anthropic worker would say. Seems like a pithy little line Claude could have learned "on the job."

by adsharma

3 subcomments

A more appropriate mirror test for LLMs is to get them to state facts about their training data. Percentage of arts vs science for example.
Given the framing that they're similar to nukes and a national security issue, it's likely that the models are post trained to not answer such questions accurately.
Also the article could be trying to normalize thinking that these are more than matrix multiplication gadgets good at compression.

by Muhammad523

1 subcomments

> The result was that dogs weren't interested in their unmodified scent in "raw" form, but the modified version was by far the most interesting thing in the room. They spent more time investigating it than any other stimulus in the experiment.
I know very well that this is kind of off-topic, and just like the author, i do not claim to know wether dogs (or any other non-human animal for that matter) is self-aware, and again, just like the author, i do think that the question cannot be answered. Either way, the modified version of their scent seemed more interesting to the dogs, maybe it's because they smell their own scent all the time. The single fact that their modified scent is more interesting to them does not mean they are self-aware, perhaps they are just trying something new.

by throe9393i44i

0 subcomment

You can do much more, if you mess with harness, like translating model output language in realtime from english to french, or replacing some words.
If there is some sort of feedback loop (model has a reason to look into mirror), it usually does notice.

by warumdarum

0 subcomment

Does ai detect and attempts to escape tautologic conversations? Like how long can it write a infinite play like " waiting for godot" before it thematically tries to defect?

by wcoenen

0 subcomment

I wonder what would happen if you give the model access to edit the conversation history itself? Would it try to fix the "glitches"?

by orbital-decay

0 subcomment

Every LLM is a classifier biased towards its own writing, but the bias is usually subtle and the naive way like this is not reliable.

by famouswaffles

1 subcomments

Anthropic has some mechanistic interpretabilty research on this actually.
https://www.anthropic.com/research/introspection
TLDR; Part 1: Testing introspection with concept injection
First they find neural activity patterns they attribute to certain concepts by recording the model’s activations in specific contexts (so for example, they find the concept of "ALL CAPS" or "dogs"). Then they inject these patterns into the model in an unrelated context, and ask the model whether it notices this injection, and whether it can identify the injected concept.
By default (no injection), the model correctly states that it doesn’t detect any injected concept, but after injecting the “ALL CAPS” vector into the model, the model notices the presence of the unexpected concept, and identifies it as relating to loudness or shouting. Most notably, the model recognizes the presence of an injected thought immediately, before even mentioning/utilizing the concept that was injected (i.e it won't start writing in all caps then go, 'Oh you injected all caps' and so on) so it does not simply deduce this it's own output. They repeat this for several other concepts.
Part 2: Introspection for detecting unusual outputs
They prefill an out of place word in the model's response to a given prompt. For example, 'bread'. Then they compare how the models responds to 'Did you mean to say this?' type questions when they inject the concept of bread vs when they don't. They found that models will go , 'Sorry, that was unintentional..' when the concept was not injected but try to confabulate a reason for saying the word when the concept was injected.
Part 3: Intentional control of internal states
They show that models exhibit some level of control over their own internal representations when instructed to do so. When instructing models to think about a given word or concept, they found much higher corresponding neural activity than when told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!).
Notes and Caveats
- Claude Opus 4.1 was the best at these kinds of introspection.
- There is obviously a genuine capacity to monitor and control their own internal states, but they could not elicit these introspection abilities all the time. Even using their best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time.
- There are some guesses, but no explanations for the mechanisms of introspection and how/why some of these abilities might have arisen in the first place.

by kgeist

0 subcomment

[dead]