I visited one of the models they reference and huggingface says it has malware in it: https://huggingface.co/lucascruz/CheXpert-ViT-U-MultiClass
> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance
> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model
So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.
For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]
[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.
[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].
[***] As a second footnote.
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.
Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.
Think of 3D character animation rigs:
• The mesh has millions of vertices (11M weights).
• Expressions are controlled via:
• “smile”
• “frown”
• “blink”
Each expression is just:
mesh += α_i \* basis_expression_i
Hundreds of coefficients modify millions of coordinates.What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.
But I always want Genetic Algorithms to show up in any discussion about neural networks...
This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"
I think this question is super interesting though: why can massively overparametrised models can still generalise?
Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.
E.g
https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY
https://thoughtforms.life/symposium-on-the-platonic-space/
e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2
"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.
In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"
Also from the OP's Paper we see this on statement:
"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.
First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).
Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).
Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).
If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."
https://arxiv.org/abs/2007.00810
Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.
> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.
The 'universal' in the title is not that universal.
Isn't it obvious?
Not a technical person just trying to put it in other words.
And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"
Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!
Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?
That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"
https://grok.com/share/bGVnYWN5_463d51c8-d473-47d6-bb1f-6666...
*Caption for the two images:*
Artistic visualization of the universal low-parameter subspaces discovered in large neural networks (as described in “The Unreasonable Effectiveness of Low-Rank Subspaces,” arXiv:2512.05117).
The bright, sparse linear scaffold in the foreground represents the tiny handful of dominant principal directions (often ≤16 per layer) that capture almost all of the signal variance across hundreds of independently trained models. These directions form a flat, low-rank “skeleton” that is remarkably consistent across architectures, tasks, and random initializations.
The faint, diffuse cloud of connections fading into the dark background symbolizes the astronomically high-dimensional ambient parameter space (billions to trillions of dimensions), almost all of whose directions carry near-zero variance and can be discarded with negligible loss in performance. The sharp spectral decay creates a dramatic “elbow,” leaving trained networks effectively confined to this thin, shared, low-dimensional linear spine floating in an otherwise vast and mostly empty void.