I’m trying to turn that into something testable with a simple constraint: “one hobbyist GPU, one day.” If meaningful progress is still possible under tight constraints, it supports the idea that we should invest more in efficiency/architecture/data work, not just bigger runs.
My favorite line >> Somewhat humorously, the acceptance that there are emergent properties which appear out of nowhere is another way of saying our scaling laws don’t actually equip us to know what is coming.
Regarding this paragraph >> 3.3 New algorithmic techniques compensate for compute. Progress over the last few years has been as much due to algorithmic improvements as it has been due to compute. This includes extending pre-training with instruction finetuning to teach models instruction following ..., model distillation using synthetic data from larger more performant "teachers" to train highly capable, smaller "students" ..., chain-of-thought reasoning ..., increased context-length ..., retrieval augmented generation ... and preference training to align models with human feedback ...
I would consider algorithmic improvements to be the following 1. architecture like ROPE, MLA 2. efficiency using custom kernels
The errors in the paper 1. Transformers for language modeling (Vaswani et al., 2023). => this shd be 2017
Disclosure: my proposed experiments: https://ohgodmodels.xyz/
Is this actually accepted? Ever since [0], I thought people recognized that they don't appear out of nowhere.
I especially agree with your point that scaling laws really killed open research. That's a shame and I personally think we could benefit from more research.
I originally didn't like calling them scaling laws.
In addition to the law part seeming a bit much, I've found that researchers often overemphasize the scale part. If scaling is predictable, then you don't need to do most experiments at very large scale. However, that doesn't seem to stop researchers from starting there.
Once you find something good, and you understand how it scales, then you can pour system resources into it. So I originally thought it would encourage research. I find it sad that it seems to have had the opposite effect.
Exactly like semiconductor wafer processing.
Compute is a massive driver for everything ML. From number of experiments you can run in paralle, to how much RL you can try out, how long stuff is running etc.
ML is pushing scaling on dimensions we haven't had before (number of Datacenters, amount of energy we put into them) and ML is currently seen as the holy grail.
But i'm definitly very very curious how this compute and current progress is playing out in the next few years. It could be that we hit a hard ceiling were every single % point becomes tremendesly costly before we hit a % point of benchmark archievements which makes all of that usable daily. OR we will se a significant change to our society.
I do not think its something in between tbh because it def feels like in an expoential progress curve we are currently in.
"One thing is certain, is the less reliable gains from compute makes our purview as computer scientists interesting again. We can now stray from the beaten path of boring, predictable gains from throwing compute at the problem."
Isn't Ilya Sutskever who said some months ago that we were going back to research ?
There is absolutely NO reason why that PDF shouldn't load today.
I also feel like most insiders were fully aware of this fact, but it was a neat sales pitch.
You want to make an existing system more efficient, then take away resources.
People spend money on this because it works. It seems odd to call observable reality a "pervasive belief".
> Academia has been marginalized from meaningfully participating in AI progress and industry labs have stopped publishing.
Firstly, I still see news items about new models that are supposed to do more with less. If these are neither from academia nor industry, where are they coming from?
Secondly, "has been marginalized"? Really? Nobody's going to be uninterested in getting better results with less compute spend, attempts have just had limited effectiveness.
.
> However, it is unclear why we need so many additional weights. What is particularly puzzling is that we also observe that we can get rid of most of these weights after we reach the end of training with minimal loss
I thought the extra weights were because training takes advantage of high-dimensional bullshit to make the math tractable. And that there's some identifiable point where you have "enough" and more doesn't help.
I hadn't heard that anyone had a workable way to remove the extra ones after training, so that's cool.
.
.
The impression I had is that there's a somewhat-fuzzy "correct" number of weights and amount of training for any given architecture and data set / information content. And that when you reach that point is when you stop getting effort-free results by throwing hardware at the problem.