- Of all Schmidhuber's credit-attribution grievances, this is the one I am most sympathetic to. I think if he spent less time remarking on how other people didn't actually invent things (e.g. Hinton and backprop, LeCun and CNNs, etc.) or making tenuous arguments about how modern techniques are really just instances of some idea he briefly explored decades ago (GANs, attention), and instead just focused on how this single line of research (namely, gradient flow and training dynamics in deep neural networks) laid the foundation for modern deep learning, he'd have a much better reputation and probably a Turing award. That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.
- > Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net.
After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.
by ekjhgkejhgk
4 subcomments
- I spent some time in the academia.
The person with whom an idea ends up associated often isn't the first person to have the idea. Most often is the person who explains why the idea is important, or find a killer application for the idea, or otherwise popularizes the idea.
That said, you can open what Schmidhuber would say is the paper which invented residual NNs. Try and see if you notice anything about the paper that perhaps would hinder the adoption of its ideas [1].
[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...
by aDyslecticCrow
0 subcomment
- I thought it was ResNet that invented the technique, but it's interesting to see it rooted back through LSTM which feels like a very architecture. ResNet really made massive waves in the field, and it was hard finding a paper that didn't reference it for a while.
- The notion of inventing or creating something in ML doesn't seem very important as many people can independently come up with the same idea. Conversely, you can create novel results just by reviewing old literature and demonstrating it in a project.
by ekjhgkejhgk
1 subcomments
- To comment on the substance.
It seems that these two people Schimidhuber and Hochreiter were perhaps solving the right problem for the wrong reasons. They thought this was important because they expected that RNNs could hold memory indefinitely. Because of BPTT, you can think of that as a NN with infinitely many layers. At the time I believe nobody worries about vanishing gradient for deep NNs, because the compute power for networks that deep just didn't exist. But nowadays that's exactly how their solution is applied.
That's science for you.
by jszymborski
0 subcomment
- "LSTMs brought essentially unlimited depth to supervised RNNs"
LSTMs are an incredible architecture, I use them a lot in my research. While LSTMs are useful over many more timesteps than other RNNs, LSTMs certainly don't offer 'essentially unlimited depth'.
When training LSTMs whose input were sequences of amino acids, whose length easily top 3,000 timesteps, I got huge amounts of instability... with gradients rapidly vanishing. Tokenizing the AAs, getting the number of timesteps down to more like 1,500, has made things way more stable.
by jaberjaber23
0 subcomment
- science repeats itself
- I'm not a giant like Schmidhuber so I might be wrong, but imo there are at least two features that set residual connections and LSTMs apart:
1. In LSTMs skip connections help propagate gradients backwards through time. In ResNets, skip connections help propagate gradients across layers.
2. Forking the dataflow is part of the novelty, not only the residual computation. Shortcuts can contain things like batch norm, down sampling, or any other operation. LSTM "residual learning" is much more rigid.
- From the domain, I'm guessing the answer is Schmidhuber.
by HarHarVeryFunny
0 subcomment
- How about Schmidhuber actually invents the next big thing rather than waiting for it to come along then claim credit for it?