FRESH

Hacker News

Home

A Theory of Deep Learning

228 points by elonlit

by r0ze-at-hn

0 subcomment

Linking to the paper: https://arxiv.org/pdf/2605.01172 which is also a fantastic read, the application to deep learning is good. It does a lot of cross-mapping and highlighting a bunch of old stuff that is named differently in this paper and worth calling out for those with those backgrounds:
"Cumulative Dissipation Gramian" Ws = Observability Gramian (from Control Theory). For example the spectral cutoff is exactly the Hankel singular value truncation from model reduction.
"Signal Channel" / "Reservoir" is Controllable/Observable vs. Uncontrollable/Unobservable Subspaces. Using Adamjan-Arov-Krein (AAK) theory gives the optimal nonlinear reduced model answering the optimal compression question.
"Drift–Diffusion Separation" is Freidlin-Wentzell Large Deviation Theory. They can predict "grokking" time from the FW action.
"Population-Risk Gate" is Quantum Weak Value / Postselection (Aharonov)
So for the follow-up problems
Control theory gives the truncation error bounds for model compression. Large deviation theory gives the grokking time predictions. Quantum measurement theory gives the imaginary preconditioners. Information geometry gives the optimal continuous relaxation of the gate.
Some nice implications of new ways of doing stuff which are nice to see formalized here:
Old: Pick architecture, hope it generalizes New: Design architecture to maximize observability Gramian rank (Honestly we pull a lot from control theory here)
Old: Use validation set to detect overfitting New: Monitor λ(Ws) spectrum during training; no validation needed
Old: Prune post-hoc based on magnitude New: Prune during training based on ker(Ws) membership
Old: Fixed learning rate New: Spectral learning rate

by ks2048

1 subcomments

The relevant paper: "A Theory of Generalization in Deep Learning". https://arxiv.org/abs/2605.01172

by arolihas

3 subcomments

Idk to me this is just redescribing what deep neural networks do without actually explaining why anything happens. I guess it "unifies" things but I am kinda over most unifying theories. Everything is Bayesian, everything is a graph or a group or some other fancy geometric structure, everything is a category. Ultimately the best framework is whatever is useful enough to explain what's happening in such a way that a practitioner can manipulate the model towards a desired outcome. In other words, where is the knob? The tool they share may be interesting and I hope to play with it to see what happens at different levels of noise applied to the labels.

by prideout

2 subcomments

This is a fascinating mathematical framework, but the post title might be a bit of an overreach. I often wonder if "a theory of deep learning" could exist that could be stated succinctly and that could predict (1) scaling laws and (2) the surprising reliability of gradient descent.
Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

by smokel

0 subcomment

This essay seems to be related to the paper "There Will Be a Scientific Theory of Deep Learning" [1] which was discussed here recently [2].
[1] https://arxiv.org/pdf/2604.21691
[2] https://news.ycombinator.com/item?id=47893779

by jhanschoo

0 subcomment

My intuitive understanding about double descent is that
1. Older ML models encoded in their architecture and lack of expressivity a bias to simplicity; which aided interpolation.
2. Overparameterized models instead use regularization to nudge parameters to simpler and more robust representations, while still memorizing the noise. In this manner, we still achieve generalization performance OOD. Moreover, the softer nudging and fundamental architectural expressivity allows for "data-specific" generalizations and representations that may be impossible to represent in small models. 3. At the critical point between the two regimes, the model is expressive enough to memorize; but not expressive enough to simultaneously both do that and encode general patterns.
I wonder how this understanding translates to these researchers' models of deep learning.

by kleiba2

0 subcomment

> This exact characterization is possible because in output space, training dynamics can be understood through a locally linear differential equation along the realized path, where dominant eigenmodes of the evolving kernel equilibrate exponentially fast. Forcing an optimizer to slowly step through these solved directions is highly inefficient and suggests a path to analytically jump to the final network state.
But at what computational cost?

by Macuyiko

0 subcomment

Reminded me strongly of the paper "Deep Learning is Not So Mysterious or Different" from a year ago: https://arxiv.org/abs/2503.02113

by hashta

1 subcomments

Interesting read. I remember the grokking paper when it came out but I don't think I've ever seen that classic grokking loss curve in my own hands on real data. Curious if others have seen it more often in practice

by minimaltom

1 subcomments

> That is, if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it. This is a one-line change to Adam that accelerates grokking by 5x, suppresses memorization in PINNs, and improves DPO fine-tuning, eliminating the need for validation sets entirely.
Does anyone understand the formula they expressed above this sentence? is this just the classic "skip updating parameters with high gradient/loss variance in multiple batches/samples" ?

by menno-sh

0 subcomment

Unrelated to the contents, but WOW your blog styling is gorgeous. Incredible

by jdw64

3 subcomments

Does anyone happen to know what font this site is using? It looks really elegant.

by airza

1 subcomments

A very fascinating read.
As a fellow tufte css enjoyer, Why is user select turned off on the sidenotes? I would like to be able to copy paste them quite badly.

by refulgentis

8 subcomments

This is a beautifully written way of saying “Some parts of what the network memorizes affect test behavior, and some don’t.” But that’s not a theory of deep learning, the grand unified theory would explain that.
We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.
Okay, but then we have: why would SGD put the right things in the right bucket?
If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.
The Borges/Lavoisier stuff is a tell. "We have unified the field” rhetoric should come after nontrivial predictions and results. Claiming to solve benign overfitting, double descent, grokking, implicit bias, risk of training on population, how to avoid a validation set, and last but not least, skipping training by analytically jumping to the end is 6 theory papers, 3 NeurIPS winners, and a $10B startup. Let's get some results before we tell everyone we unified the field. :) I hope you're right.

by xiaodai

2 subcomments

What a beautifully written article. It's extremely that I favourite an article but this is one.

by hashmap

0 subcomment

this landed precisely on like 3 weird bugs ive been hitting and solving in different stupid ways for dealing with things like sgd collapsing too many good answers into one bad answer, and gave me a real direction to try to fix the link missing in my own ml stuff. what timing. i have tried analytic solutions too and they're useful for like mapping prompts into memory geometry but from there ive ended up still having to use sgd. cause i think what happens is, sgd teaches the neural net both the geometry and how to navigate it. if you just teleport to the answer it doesnt learn how to walk.

0 subcomment

by auggierose

0 subcomment

Looks like a typical machine learning paper to me. It cannot be understood unless you already kind of understand it. That is OK for communication with peers, but eventually I expect a "theory of" to be readable by anyone with a math degree.

0 subcomment

by jeffrallen

0 subcomment

This looks like excellent work, it's reminding me of things I learned from Welch Labs vidoes. Given the amount of time I budget for keeping up with this stuff (regrettably too low) I'll wait until Welch Labs presents something on this.

by vessenes

1 subcomments

So, this is either the paper of the year, or ... definitely not the paper of the year.
https://arxiv.org/pdf/2605.01172 is the current version. The money graphs are page 8 and on where they show (some weirdly thick) line charts with loss results reached in roughly 1/5 the number of steps that Adam takes, just what the blog post mentions.
They also claim holding back test data is not needed, also with more graphs.
I'm not an ML scientist, and I did not attempt to seriously parse the math. It reads to me as something precisely in that liminal space some math papers do where there's enough new terminology that actually parsing through it all is going to take real, concerted effort, possibly with mild brain damage as a risk.
Their 3d graphs of "kernel eigenstructure" also do double duty for me as totally impenetrable and possibly part of an April fool's ML paper that's hilarious to insiders. Or maybe they show something really amazing; they definitely seem to converge into a shape...What does that shape mean??? Why??? What is an eigenstructure? Is it just 3D eigenvectors of some matrices? Is it natural to have a 3D shape representing these large matrices? If not, how and why were these projected down? And why are they different colors in the paper?? You get the feel for my level of understanding.
I think it would frankly just be easier to validate this claim than parse the whole paper. If only I could understand
```
  > Each one-step kernel increment ηKMtSS integrates into WMS , so a sequence of one-step rate-maximizers is the greedy policy whose integral is the signal-channel content of the trajectory through G, exactly as plain SGD is the greedy step whose integral is empirical-risk descent through D. The diagonal cutoff µ2 k >σ2 k/(b−1) is the optimal first-order preconditioner for population risk on any diagonal base, and a streaming variance EMAˆst of squared gradient deviations realizes it as a one-line change to AdamW: one extra parameter-sized state vector and a per parameter gate that multiplies the standard moment update
```
Well enough to implement the one line update to Adam in python. I have not asked codex or claude to assist yet.
Also of note to me, they talk about grokking which I found SUUUPER fascinating when it was first reported, and have never heard about since. So I was really glad to read about it and read that there has been a little academic work on the phenomenon.
Finally, of the three models they repot results on, two are extremely tiny, the last is a DPO round on Qwen 0.5B -- if the code for that is published, I imagine it would be easy to adapt and evaluate in other regimes.

by hiroakiaizawa

0 subcomment

[flagged]

by codemog

0 subcomment

Where’s the theory of how the human brain does what it does? Maybe these high dimensional structures don’t have a nice compact “theory”. Trying to fit these systems into a nice compact theory is a very human thing, but not everything works like that.