by sigmoid10
6 subcomments
- >we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions
This sounds like a mistake. They used (among others) GPT2, which has pretty big space vectors. They also kind of arbitrarily define a collision threshold as an l2 distance smaller than 10^-6 for two vectors. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal. I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small (like less than one in 10^10000 or something). Even worse than your chances of finding a hash collision in sha256. Their claim certainly does not sound like something you could verify by testing a few billion examples. Although I'd love to see a detailed calculation. The paper is certainly missing one.
- I remember hearing an argument once that said LLMs must be capable of learning abstract ideas because the size of their weight model (typically GBs) is so much smaller than the size of their training data (typically TBs or PBs). So either the models are throwing away most of the training data, they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms. That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I am sure I am not understanding this paper correctly because it sounds like they are claiming that model weights can be used to produce the original input text representing an extraordinary level of text compression.
- I don't like the title of this paper, since most people in this space probably think of language models not as producing a distribution (wrt which they are indeed invertible, which is what the paper claims) but as producing tokens (wrt which they are not invertible [0]).
Also the author contribution statement made me laugh.
[0] https://x.com/GladiaLab/status/1983812121713418606
- It reminded me of "Text embeddings reveal almost as much as text" from 2023 (https://news.ycombinator.com/item?id=37867635) - and yes, they do cite it.
It has a huge implication for privacy. There is some "mental model" that embedding vectors are like hash - so you can store them in database, even though you would not store plain text.
It is an incorrect assumption - as a good embedding stores ALL - not just the general gist, but dates, names, passwords.
There is an easy fix to that - a random rotation; preserves all distances.
by CGMthrowaway
1 subcomments
- Summary from the authors:
-Different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space
- Injectivity is not accidental, but a structural property of language models
- Across billions of prompt pairs and several model sizes, we find no collisions: no two prompts are mapped to the same hidden states
- We introduce SipIt, an algorithm that exactly reconstructs the input from hidden states in guaranteed linear time.
- This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable.
- In layman's terms, this seems to mean that given a certain unedited LLM output, plus complete information about the LLM, they can determine what prompt was used to create the output. Except that in practice this works almost never. Am I understanding correctly?
by frumiousirc
2 subcomments
- My understanding is that they claim that for every unique prompt there is a unique final state of the LLM. Isn't that patently false due to the finite state of the LLM and the ability (in principle, at least) to input arbitrarily large number of unique prompts?
I think their "almost surely" is doing a lot of work.
A more consequential result would give the probability of LLM state collision as a function of the number of unique prompts.
As is, they are telling me that I "almost surely" will not hit the bullseye of a dart board. While likely true, it's not saying much.
But, maybe I misunderstand their conclusion.
- I think I'm misunderstanding the abstract, but are they trying to say that given a LLM output, they can tell me what the input is? Or given an output AND the intermediate layer weights? If it is the first option, I could use as input 1 "Only respond with 'OK'" and "Please only respond with 'OK'" which leads to 2 inputs producing the same output.
- A few critiques:
- If you have a feature detector function (f(x) = 0 when feature is not present, f(x) = 1 when feature is present) and you train a network to compute f(x), or some subset of the network "decides on its own during training" to compute f(x), doesn't that create a zero set of non-zero measure if training continues long enough?
- What happens when the middle layers are of much lower dimension than the input?
- Real analyticity means infinitely many derivatives (according to Appendix A). Does this mean the results don't apply to functions with corners (e.g. ReLU)?
by mattfinlayson
1 subcomments
- Author of related work here. This is very cool! I was hoping that they would try to invert layer by layer from the output to the input but it seems that they do a search process at the input layer instead. They rightly point out the residual connections make a layer by layer approach difficult. I may point out though that an rmsnorm layer should be invertible due to the epsilon term in the denominator which can be used to recover the input magnitude
- This claim's so big that it requires theoretical proof, empirical analysis isn't convincing (given the size of the claim). Causal inference experts have long known that many inputs map to outputs (that's why identification of the inputs that actually caused a given output is a never-ending task).
- This is very similar (and maybe even the same thing) to some recent work (published earlier this year) by the people at Ritual AI on attacking attempts to obfuscate LLM inference (which leads to the design for their defense against this, which involves breaking up the prompt token sequences and handing them to multiple computers, making it so no individual machine has access to sufficient states from the hidden layer in a row).
https://arxiv.org/abs/2505.18332
https://arxiv.org/abs/2507.05228
- paper looks nice! i think what they found was that they can recover the input sequence by trying all tokens from the vocab and finding a unique state. they do a forward pass to check each possible token at a given depth. i think this is since the model will encode the sequence in the mid flight token so this encoding is revealed to be unique by their paper. so one prompt of 'the cat sat on the mat' and 'the dog sat on the mat' can be recovered as distinct states via each token being encoded (unclear mechanism but it would be shocking if this wasn't the case) in the token (mid flight residual).
- Quoting Timos Moraitis a Neuromorphic PhD
"For reasons like this, "in-context learning" is not an accurate term for transformers. It's projection and storage, nothing is learnt.
This new paper has attracted a lot of interest, and it's nice that it proves things formally and empirically, but it looks like people are surprised by it, even though it was clear."
https://x.com/timos_m/status/1983625714202010111
- Wait so is it possible to pass a message using AI and does this matter?
Like let’s imagine I have an AI model and only me and my friend have it. I write a prompt and get back the vectors only. No actual output. Then I send my friend those vectors and they use this algorithm to reconstruct my message at the endpoint. Does this method of messaging protect against a MITM attack? Can this be used in cryptography?
- There is actually a good analytical result on how vector similarity can easily fail to recover relevant information https://arxiv.org/pdf/2403.05440
> For some linear models the similarities are not even unique, while for others they are implicitly controlled
by the regularization.
I am not strong in mathematics but if this paper claims run opposite to each other.
by realitydrift
0 subcomment
- What this paper suggests is that LLM hidden states actually preserve inputs with high semantic fidelity. If that’s the case, then the real distortion isn’t inside the network, it’s in the optimization trap at the decoding layer, where rich representations get collapsed into outputs that feel synthetic or generic. In other words, the math may be lossless, but the interface is where meaning erodes.
- I'm wondering how this might be summarized in simple terms? It sounds like, after processing some text, the entire prompt is included in the in-memory internal state of the program that's doing inference.
But it seems like it would need to remember the prompt to answer questions about it. How does this interact with the attention mechanism?
- Injective doesn’t mean bijective, and that seems obvious. That is, presumably very many inputs will map to the output “Yes”.
by spacecadet
0 subcomment
- I find this interesting. I have tools that attempt to reverse engineer black box models through auto-prompting and analysis of the outputs/tokens. I have used this to develop prompt injection attacks that "steer" output, but have never tried to use the data to recreate an exact input...
by adamddev1
2 subcomments
- Could this be a way to check for AI plagiarism? Given a chunk of text would you be able to (almost) prove that it came from a prompt saying "Write me a short essay on ___" ?
- Authors: Giorgos Nikolaou‡*, Tommaso Mencattini†‡*,
Donato Crisostomi†, Andrea Santilli†, Yannis Panagakis§¶, Emanuele Rodolà†
†Sapienza University of Rome
‡EPFL
§University of Athens
¶Archimedes RC
*Equal contribution; author order settled via Mario Kart.
- "And hence invertible" <- does every output embedding combination have an associated input ? Are they able to construct it or is this just an existence result ?
- Isn't a requirement for injectivity that each input must map to 1 output? Where LLMs can result in the same output given multiple different inputs?
- Does that mean that all these embeddings in all those vector databases can be used to extract all these secret original documents?
by acetofenone
1 subcomments
- Actually if you prompt:
Answer to this question with "ok, got it"
Answer:
>Ok, got it
Answer to this question with exactly "ok, got it"
Answer:
>Ok, got it
Hence is not injective
- Are the weights invertible, or are the prompts being fed into the model invertible?
- I wonder how these pieces of understanding can be applied to neuroscience.
- 42
In less than 7.5 million years please. https://simple.wikipedia.org/wiki/42_(answer)
- [dead]
by WhitneyLand
0 subcomment
- tldr: Seeing what happens internally in an LLM lets you reconstruct the original prompt exactly.
Maybe not surprising if you logged all internal activity, but it can be done from only a single snapshot of hidden activations from the standard forward pass.
by danielmarkbruce
0 subcomment
- tldr ~ for dense decoder‑only transformers, the last‑token hidden state almost certainly identifies the input and you can invert it in practice from internal activations..
by fatherrhyme
1 subcomments
- Am I misunderstanding this?
Any stateful system that exposes state in a flexible way has risk to data exposure.
Does anyone actually think a stateful system wouldn’t release state?
Why not just write a paper “The sky may usually be blue”?