FRESH

Hacker News

Home

Language models pack billions of concepts into 12k dimensions

363 points by lawrenceyan

by cgadski

5 subcomments

> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.

by yorwba

4 subcomments

I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.

by dwohnitmok

1 subcomments

These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.
A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html

by rossant

2 subcomments

Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.

by gpjanik

2 subcomments

Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.

by aabhay

3 subcomments

My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy

by twotwotwo

0 subcomment

Sort of trivial but fun thing: you can fit billions of concepts into this much space, too. Let's say four bits of each component of the vector are important, going by how some providers do fp4 inference and it isn't entirely falling apart. So an fp4 dimension-12K vector takes up 6KB, like a few pages of UTF-8 text, more compressed text, or 3K tokens in a 64K-token embedding. How many possible multi-page 'thought's are there? A lot!
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.

by singularity2001

3 subcomments

If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.

by stared

0 subcomment

If vectors life in an effectively lower space that they could, they don't live up to their n-dimensional potential.
Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.

by bigdict

5 subcomments

What's the point of the relu in the loss function? Its inputs are nonnegative anyway.

by ignobletruth

0 subcomment

Question for you experts here.
This article uses theory to imply a high bound for semantic capacity in a a vector space.
However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.
These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?

by gibsonf1

0 subcomment

A key error is there literally are no where close to billions of concepts. Its a misunderstanding of what a concept is as used by us humans. There are an unlimited number of instances and entities, but the concepts we use to think about them is very limited by comparison.

by Mithriil

0 subcomment

For those that argue that concepts are not orthogonal or quasi-orthogonal, then see the quasi-orthogonal case as the worst-case: "if all concepts were black and white, then how many can we fit in k dimensions". When there are nuanced concepts, then they will fit in between these quasi-orthogonals ones. What's argued here is thus a lower-bound.

by djoldman

0 subcomment

A continuing, probably unending, opportunity/tragedy is the under-appreciation of representation learning / embeddings.
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.

by niemandhier

1 subcomments

Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.

by rini17

1 subcomments

I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?

by prerok

0 subcomment

So, a lot of comments have already poked lots of holes in the article, but just wanted to chime in with a very basic observation: the mere statement that the 12k dimensions can pack in 10^200 concepts is staggering in how wrong it is.
Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.

by WithinReason

0 subcomment

The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.

by cpldcpu

0 subcomment

The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.

by lvl155

0 subcomment

They don’t capture concepts at all. They capture writings of concepts.

by yukIttEft

0 subcomment

newbie question: when training networks, what mechanism makes the language's concepts be (almost)orthogonal to each other?

0 subcomment

by LolWolf

0 subcomment

Not to completely plug my own work here, but I also wrote about this for a slightly more mathematical audience (and uhh, a much shorter post): "There are exponentially many vectors with small inner product"
https://lmao.bearblog.dev/exponential-vectors/
For those who are interested in the more "math-y" side of things.
For what it's worth, I don't fully understand the connection between the JL lemma and this "exponentially many vectors" statement, other than the fact that their proof relies on similar concentration behavior.

by stogot

1 subcomments

Is the definition of dimensions here the same as 2D, 3D, 4D, etc or some other abstract mathematical concept?

0 subcomment

by alexpivnenko

0 subcomment

Interesting that the practical C values were much below the theoretical bounds.

by js8

2 subcomments

You can also imagine a similar thing on binary vectors. There two vectors are "orthogonal" if they share no bits that are set to one. So you can encode huge number of concepts using only small number of bits in modestly sized vectors, and most of them will be orthogonal.

by jgbuddy

0 subcomment

this is like saying computers fit a billion numbers in 32 bits. Each dimension adds a new degree of space

by j7ake

0 subcomment

The universe packs in even more concepts: only 3 or 4 dimensions

by rob_c

0 subcomment

Ok.
Now try to separate the "learning the language" from "learning the data".
If we have a model pre trained on language does it then learn concepts quicker, the same or different?
Can we compress just data in a lossy into an LLM like kernel which regenerates the input to a given level of fidelity?

by fedeb95

0 subcomment

*string representations of concepts

by igiveup

0 subcomment

"Blessing of dimensionality"?

by jibal

3 subcomments

There are no "real-world concepts" or "semantic meaning" in LLMs, there are only syntactic relationships among text tokens.

by highfrequency

0 subcomment

> posed a fascinating question: How can a relatively modest embedding space of 12,288 dimensions (GPT-3) accommodate millions of distinct real-world concepts?
Because there is a large number of combinations of those 12k dimensions? You don’t need a whole dimension for “evil scientist” if you can have a high loading on “evil” and “scientist.” There is quickly a combinatorial explosion of expressible concepts.
I may be missing something but it doesn’t seem like we need any fancy math to resolve this puzzle.

by mallowdram

2 subcomments

Space embedding based on arbitrary points never resolves to specifics. Particularly downstream. Words are arbitrary, we remained lazy at an unusually vague level of signaling because arbitrary signals provide vast advantages for the sender and controller of the signal. Arbitrary signals are essentially primate dominance tools. They are uniquely one-way. CS never considered this. It has no ability to subtract that dark matter of arbitrary primate dominance that's embedded in the code. Where is this in embedded space?
LLMs are designed for Western concepts of attributes, not holistic, or Eastern. There's not one shred of interdependence, each prediction is decontextualized, the attempt to reorganize by correction only slightly contextualizes. It's the object/individual illusion in arbitrary words that's meaningless. Anyone studying Gentner, Nisbett, Halliday can take a look at how LLMs use language to see how vacant they are. This list proves this. LLMs are the equivalent of circus act using language.
"Let's consider what we mean by "concepts" in an embedding space. Language models don't deal with perfectly orthogonal relationships – real-world concepts exhibit varying degrees of similarity and difference. Consider these examples of words chosen at random: "Archery" shares some semantic space with "precision" and "sport" "Fire" overlaps with both "heat" and "passion" "Gelatinous" relates to physical properties and food textures "Southern-ness" encompasses culture, geography, and dialect "Basketball" connects to both athletics and geometry "Green" spans color perception and environmental consciousness "Altruistic" links moral philosophy with behavioral patterns"