FRESH

Hacker News

Which one is more important: more parameters or more computation? (2021)

59 points by jxmorris12

by vorticalbox

1 subcomments

This reminds me of https://dnhkng.github.io/posts/rys/
David looks into the LLM finds the thinking layers and cut duplicates then and put them back to back.
This increases the LLM scores with basically no over head.
Very interesting read.

by vessenes

0 subcomment

Interesting little bit of history; this pre-Chinchilla paper proposed MoE training longer would improve performance. Good idea. They also proposed using a hash function to choose experts rather than training a routing layer and showed it marginally better at the time than existing routing techniques.
I’d guess that the hash function worked better because by definition it does not collapse; a modern training run of an MoE model will include careful attention to usage of experts, and expect some to be more ‘hot’ than others — e.g. totally flat percentage choice is a bad sign, and also look for unused or radically underutilized experts as well.

by kang

1 subcomments

The answer should be obvious that its both.
Zurada was one of our AI textbook that makes it visual that right from a simple classifier to a large language model, we are mathematically creating a shape(, that the signal interacts with). More parameters would mean shape can be curved in more ways and more data means the curve is getting hi-definition.
They reach something with data, treating neural network as blackbox, which could be derived mathematically using the information we know.

by mskogly

0 subcomment

Selective training data, lora fine tuning or MOE are other solutionsZ Sure, creating a model with 100 billion parameters will yield good results, but it’s sort of like employing a million random people to play darts. Or shooting sparrows with A nuclear bomb.

by l4tq3

0 subcomment

by 34ylsh

0 subcomment