When a technique or technology is new people are making massive gains by just applying it to some use case, or gathering more data for training, or giving it more resources.
As time goes on those "bitter lesson" gains start to hit the shallow part of the logistic curve and companies have to start investing more and more effort into engineering for each small, incremental gain.
- https://ianbarber.blog/blogroll
- https://ianbarber.blog/archive
- https://ianbarber.blog/posts
- none of the above links work
- i really dont want to scroll 200 pages just to see what your blog articles are
But, I think the underlying problem is that we don't understand how this sh*t works. So, it's just an empirical, iterative mess.
Like physics in the the years shortly before relativity and quantum mechanics.
Nice, hadn't seen this one before.
https://sebastianraschka.com/llm-architecture-gallery/?compa...
If you look at it, the diagrams are very similar, but the main differences are that the feedforward is replaced with a MoE (router to multiple feedforwards) and the model has a different attention implementation.