FRESH

Hacker News

The wall confronting large language models

160 points by PaulHoule

by measurablefunc

16 subcomments

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

by Animats

3 subcomments

That article is weird. They seem obsessed with nuclear reactors. Also, they misunderstand how floating point works.
As one learns at high school, the continuous derivative is the limit of the discrete version as the displacement h is sent to zero. If our computers could afford infinite precision, this statement would be equally good in practice as it is in continuum mathematics. But no computer can afford infinite precision, in fact, the standard double-precision IEEE representation of floating numbers offers an accuracy around the 16th digit, meaning that numbers below 10−16 are basically treated as pure noise. This means that upon sending the displacement h below machine precision, the discrete derivatives start to diverge from the continuum value as roundoff errors then dominate the discretization errors.
Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck. A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.

by pama

4 subcomments

Sauro, if you read this, please refrain from such low-content speculative statements:
“On a loose but telling note, this is still three decades short of the number of neural connections in the human brain, 1015, and yet they consume some one hundred million times more power (GWatts as compared to the very modest 20 Watts required by our brains).”
No human brain could have time to read all the materials of a modern LLM training run even if they lived and read eight hours a day since humans first appeared over 300,000 years ago. More to the point, inference of an LLM is way more energy efficient than human inference (see the energy costs of a B200 decoding a 671B parameter model and estimate the energy needed to write the equivalent of a human book worth of information as part of a larger batch). The main reason for the large energy costs of inference is that we are serving hundreds of millions of people with the same model. No humans have this type of scaling capability.

by Scene_Cast2

3 subcomments

The paper is hard to read. There is no concrete worked-through example, the prose is over the top, and the equations don't really help. I can't make head or tail of this paper.

by klawed

0 subcomment

> avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.
I wonder if the authors are aware of The Bitter Lesson

by CuriouslyC

1 subcomments

This article is accurate. That's why I'm investigating a bayesian symbolic lisp reasoner. It's incapable of hallucinating, it provides auditable traces which are actual programs and it kicks the crap out of LLMs at stuff like Arc-Agi, symbolic reasoning, logic programs, game playing, etc. I'm working on a paper where I show that the same model can break 80 on arc-agi, run the house by counting cards at blackjack, and solve complex mathematical word problems.

by awanderingmind

0 subcomment

There is a lot of focus in the comments on the authors' credentials and, apparently, their writing style. It is a pity, because I think their discussion of scaling is interesting, even if comparing LLMs to grid-based differential equation solvers might be unconventional (I haven't convinced myself whether the analogy is entirely apt/valid yet, but it could conceivably be).

by phoenixhaber

0 subcomment

I don't get it. Explain in layman's terms please? Without getting involved with the math which looks quite complicated it appears that they simply are assuming scaling with what is currently known without model improvements.

by dcre

2 subcomments

Always fun to see a theoretical argument that something clearly already happening is impossible.

by 18cmdick

0 subcomment