This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.
This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.
This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.
If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.
This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position
You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.
Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
it goes all over the place.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer
Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".
Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.
In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.