FRESH

Hacker News

Home

The universal weight subspace hypothesis

353 points by lukeplato

by modeless

5 subcomments

This seems confusingly phrased. When they say things like "500 Vision Transformers", what they mean is 500 finetunes of the same base model, downloaded from the huggingface accounts of anonymous randos. These spaces are only "universal" to a single pretrained base model AFAICT. Is it really that surprising that finetunes would be extremely similar to each other? Especially LoRAs?
I visited one of the models they reference and huggingface says it has malware in it: https://huggingface.co/lucascruz/CheXpert-ViT-U-MultiClass

by altairprime

5 subcomments

For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:
> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance
> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model
So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.
For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]
[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.
[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].
[***] As a second footnote.

by augment_me

1 subcomments

I think the paper in general completely oversells the idea of "universality".
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.

by masteranza

2 subcomments

It's basically way better than LoRA under all respects and could even be used to speed up inference. I wonder whether the big models are not using it already... If not we'll see a blow up in capabilities very, very soon. What they've shown is that you can find the subset of parameters responsible for transfer of capability to new tasks. Does it apply to completely novel tasks? No, that would be magic. Tasks that need new features or representations break the method, but if it fits in the same domain then the answer is "YES".
Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.
```
  Think of 3D character animation rigs:
  
   • The mesh has millions of vertices (11M weights).
  
   • Expressions are controlled via:
  
   • “smile”
  
   • “frown”
  
   • “blink”
  
  Each expression is just:
  
  mesh += α_i \* basis_expression_i
  
  Hundreds of coefficients modify millions of coordinates.
```

by alyxya

1 subcomments

I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful.
What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.

by kacesensitive

4 subcomments

interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time

by VikingCoder

5 subcomments

I find myself wanting genetic algorithms to be applied to try to develop and improve these structures...
But I always want Genetic Algorithms to show up in any discussion about neural networks...

by canjobear

3 subcomments

What's the relationship with the Platonic Representation Hypothesis?

by nothrowaways

0 subcomment

What if all models are secretly just fine tunes of llama?

by inciampati

0 subcomment

The authors study a bunch of wild low rank fine tunes and discover that they share a common... low rank! ... substructure which is itself base model dependent. Humans are (genetically) the same. You need only a handful of PCs to represent the cast majority of variation. But that's because of our shared ancestry. And maybe the same thing is going on here.

by statusfailed

0 subcomment

I saw a similar (I think!) paper "Grassmannian Optimization Drives Generalization in Overparameterized DNN" at OPT-ML at neurips last week[0]
This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"
I think this question is super interesting though: why can massively overparametrised models can still generalise?
[0]: https://opt-ml.org/papers/2025/paper90.pdf

by hn_throwaway_99

2 subcomments

I read the abstract (not the whole paper) and the great summarizing comments here.
Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.

by mwkaufma

0 subcomment

(Finds a compression artifact) "Is this the meaning of consciousness???"

by AIorNot

1 subcomments

Interesting - I wonder if this ties into the Platonic Space Hypothesis recently being championed by computational biologist Mike Levin
E.g
https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY
https://thoughtforms.life/symposium-on-the-platonic-space/
e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2
"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.
In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"
Also from the OP's Paper we see this on statement:
"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.
First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).
Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).
Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).
If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."

by tsurba

1 subcomments

Many discriminative models converge to same representation space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.
https://arxiv.org/abs/2007.00810
Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.

by RandyOrion

0 subcomment

> From their project page:
> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.
The 'universal' in the title is not that universal.

by horsepatties

0 subcomment

I hope that this leads to more efficient models. And it’s intuitive- it seems as though you could find the essence of a good model and a model reduced to that essence would be more efficient. But, this is theoretical. I can also theorize flying cars- many have, it seems doable and achievable, but yet I see no flying cars on my way to work.

by nothrowaways

3 subcomments

> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA- 8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight directions capture dominant variance despite vast differences in training data, objectives, and initialization.
Isn't it obvious?

0 subcomment

by farhanhubble

0 subcomment

Would you see a lower rank subspace if the learned weights were just random vectors?

by api

1 subcomments

I immediately started thinking that if there are such patterns maybe they capture something about the deeper structure of the universe.

by Simplita

0 subcomment

Curious if this connects with the sparse subnetwork work from last year. There might be an overlap in the underlying assumptions.

by CGMthrowaway

2 subcomments

They compressed the compression? Or identified an embedding that can "bootstrap" training with a headstart ?
Not a technical person just trying to put it in other words.

by ibgeek

1 subcomments

They are analyzing models trained on classification tasks. At the end of the day, classification is about (a) engineering features that separate the classes and (b) finding a way to represent the boundary. It's not surprising to me that they would find these models can be described using a small number of dimensions and that they would observe similar structure across classification problems. The number of dimensions needed is basically a function of the number of classes. Embeddings in 1 dimension can linearly separate 2 classes, 2 dimensions can linearly separate 4 classes, 3 dimensions can linearly separate 8 classes, etc.

by lucid-dev

0 subcomment

Pretty funny if you ask me. Maybe we can start to realize now: "The common universal subspace between human individuals makes it easier for all of them to do 'novel' tasks so long as their ego and personality doesn't inhibit that basic capacity."
And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"
Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!
Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?
That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"

by pmkary

0 subcomment

Plato's forms finally being proven...

by zkmon

0 subcomment

Something tells me this is probably as important as the "attention is all you need".

by nextworddev

0 subcomment

The central claim, or "Universal Weight Subspace Hypothesis," is that deep neural networks, even when trained on completely different tasks (like image recognition vs. text generation) and starting from different random conditions, tend to converge to a remarkably similar, low-dimensional "subspace" in their massive set of weights.

by tempestn

0 subcomment

After reading the title I'm disappointed this isn't some new mind-bending theory about the relativistic nature of the universe.

0 subcomment

by odyssey7

0 subcomment

Now that we know about this, that the calculations in the trained models follow some particular forms, is there an approximation algorithm to run the models without GPUs?

by Atlas667

1 subcomments

Imagine collectively trying to recreate a human brain with semiconductors so capitalists can save money by not having to employ as many people

by ycombigrator

0 subcomment

[dead]

by zirt

0 subcomment

[dead]

by YouAreWRONGtoo

0 subcomment

[dead]

0 subcomment

by zkmon

1 subcomments

So, while the standard models are like herbivores grazing on the internet data, they built a model that is a carnivore or a predator species trained on other models? Sounds like an evolution of the species.

by pagekicker

1 subcomments

I asked Grok to visualize this:
https://grok.com/share/bGVnYWN5_463d51c8-d473-47d6-bb1f-6666...
*Caption for the two images:*
Artistic visualization of the universal low-parameter subspaces discovered in large neural networks (as described in “The Unreasonable Effectiveness of Low-Rank Subspaces,” arXiv:2512.05117).
The bright, sparse linear scaffold in the foreground represents the tiny handful of dominant principal directions (often ≤16 per layer) that capture almost all of the signal variance across hundreds of independently trained models. These directions form a flat, low-rank “skeleton” that is remarkably consistent across architectures, tasks, and random initializations.
The faint, diffuse cloud of connections fading into the dark background symbolizes the astronomically high-dimensional ambient parameter space (billions to trillions of dimensions), almost all of whose directions carry near-zero variance and can be discarded with negligible loss in performance. The sharp spectral decay creates a dramatic “elbow,” leaving trained networks effectively confined to this thin, shared, low-dimensional linear spine floating in an otherwise vast and mostly empty void.