If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
This article uses theory to imply a high bound for semantic capacity in a a vector space.
However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.
These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.
https://lmao.bearblog.dev/exponential-vectors/
For those who are interested in the more "math-y" side of things.
For what it's worth, I don't fully understand the connection between the JL lemma and this "exponentially many vectors" statement, other than the fact that their proof relies on similar concentration behavior.
Now try to separate the "learning the language" from "learning the data".
If we have a model pre trained on language does it then learn concepts quicker, the same or different?
Can we compress just data in a lossy into an LLM like kernel which regenerates the input to a given level of fidelity?
Because there is a large number of combinations of those 12k dimensions? You don’t need a whole dimension for “evil scientist” if you can have a high loading on “evil” and “scientist.” There is quickly a combinatorial explosion of expressible concepts.
I may be missing something but it doesn’t seem like we need any fancy math to resolve this puzzle.
LLMs are designed for Western concepts of attributes, not holistic, or Eastern. There's not one shred of interdependence, each prediction is decontextualized, the attempt to reorganize by correction only slightly contextualizes. It's the object/individual illusion in arbitrary words that's meaningless. Anyone studying Gentner, Nisbett, Halliday can take a look at how LLMs use language to see how vacant they are. This list proves this. LLMs are the equivalent of circus act using language.
"Let's consider what we mean by "concepts" in an embedding space. Language models don't deal with perfectly orthogonal relationships – real-world concepts exhibit varying degrees of similarity and difference. Consider these examples of words chosen at random: "Archery" shares some semantic space with "precision" and "sport" "Fire" overlaps with both "heat" and "passion" "Gelatinous" relates to physical properties and food textures "Southern-ness" encompasses culture, geography, and dialect "Basketball" connects to both athletics and geometry "Green" spans color perception and environmental consciousness "Altruistic" links moral philosophy with behavioral patterns"