FRESH

Hacker News

Home

Ask HN: How are Markov chains so different from tiny LLMs?

200 points by JPLeRouzic

by Sohcahtoa82

11 subcomments

A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.
But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.
In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.
This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

by kleiba

5 subcomments

Markov chains of order n are essentially n-gram models - and this is what language models used to be for a very long time. They are quite good. As a matter of fact, they were so good that more sophisticated models often couldn't beat them.
But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.
Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.
With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.
But for n-grams, just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.

by lukev

0 subcomment

Other comments in this thread do a good job explaining the differences in the Markov algorithm vs the transformer algorithm that LLMs use.
I think it's worth mentioning that you have indeed identified a similarity, in that both LLMs and Markov chain generators have the same algorithm structure: autoregressive next-token generation.
Understanding Markov chain generators is actually a really really good step towards understanding how LLMs work, overall, and I think its a really good pedagogical tool.
Once you understand Markov generating, doing a bit of handwaving to say "and LLMs are just like this except with a more sophisticated statistical approach" has the benefit of being true, demystifying LLMs, and also preserving a healthy respect for just how powerful that statistical model can be.

by MarkusQ

0 subcomment

LLMs include mechanisms (notably, attention) that allow longer-distance correlations than you could get with a similarly-sized Markov chain. If you squint hard enough though, they are Markov chains with this "one weird trick" that makes them much more effective for their size.

by inciampati

2 subcomments

Markov chains have exponential falloff in correlations between tokens over time. That's dramatically different than real text which contains extremely long range correlations. They simply can't model long range correlations. As such, they can't be guided. They can memorize, but not generalize.

by yobbo

0 subcomment

A hidden markov model (HMM) is theoretically capable of modelling text just as well any transformer. Typically, HMMs are probability distributions over a hidden discrete state space, but the distribution and state space can be anything. The size of the state space and transition function determines its capacity. RNNs are effectively HMMs, and recent ones like "Mamba" and so on are considered competent.
Transformers can be interpreted as tricks that recreate the state as a function of the context window.
I don't recall reading about attempts to train very large discrete (million states) HMMs on modern text tokens.

by qoez

0 subcomment

From a core openai insider who have likely trained very large markov models and large transformers: https://x.com/unixpickle/status/1935011817777942952
Untwittered: A Markov model and a transformer can both achieve the same loss on the training set. But only the transformer is smart enough to be useful for other tasks. This invalidates the claim that "all transformers are doing is memorizing their training data".

by Anon84

1 subcomments

Not that different. In fact, you can use Markov Chain theory as an analytical tool to study LLMs: https://arxiv.org/abs/2410.02724
You could probably point your code to Google Books N-grams (https://storage.googleapis.com/books/ngrams/books/datasetsv3...) and get something that sounds (somewhat) reasonable.

by dragonwriter

2 subcomments

I think, strictly speaking, autoregressive LLMs are markov chains of a very high order.
The trick (aside from the order) is the training process by which they are derived from their source data. Simply enumerating the states and transitions in the source data and the probability of each transition from each state in the source doesn’t get you an LLM.

by shenberg

0 subcomment

The short and unsatisfying answer is that an LLM generation is a markov chain, except that instead of counting n-grams in order to generate the posterior distribution, the training process compresses the statistics into the LLM's weights.
There was an interesting paper a while back which investigated using unbounded n-gram models as a complement to LLMs: https://arxiv.org/pdf/2401.17377 (I found the implementation to be clever and I'm somewhat surprised it received so little follow-up work)

by tlarkworthy

1 subcomments

Markov chains learn a fixed distribution, but transformers learn the distribution of distributions and latch onto what the current distribution based on evidence seen so far. So that's where the single shot learning comes from in transformer. Markov chains can't do that, they will not change the underlying distribution as they read.

by thatjoeoverthr

1 subcomments

Others have mentioned the large context window. This matters.
But also important is embeddings.
Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.
In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.
This does two things for you.
First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.
Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.
I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.
https://joecooper.me/blog/crosstalk/

by OvrUndrInformed

0 subcomment

Markov Processes are a pretty general concept that can be used to model just about anything if you let the “state” also incorporate some elements of the “history”. I assume that the way transformers are used to model language (next token prediction) can be considered a Markov process where the transition function is modeled by the LLM. Transitions between a state (given by [n] previous tokens, which are the context + text generated so far) and the next state (given by state[n+1] tokens, which are the context + text generated so far + the newest generated token) are given by the probability distribution output by the LLM. Basically I think you can consider auto-regressive LLMs as parameterized Markov processes. Feel free to correct me if I’m wrong.

by unoti

0 subcomment

If you're not sure about what a Markov Chain is, or if you've never written something from scratch that learns, take a look at this repo I made to try to bridge that gap and make it simple and understandable. You can read it in a few minutes. It starts with nothing but Python, and ends with generating text based on the D&D Dungeon Master Manual. https://github.com/unoti/markov-basics/blob/main/markov-basi...

by kleiba

0 subcomment

What's a tiny LLM, given that the first L stands for "large"?

by wodenokoto

0 subcomment

I'd say one of the main differences is that a Markov chain trained over N-grams works on discreet n-grams. Therefore the markov chain cannot tell the difference between two contexts never seen in training. They will both be the "unknown"-token.
An LLM will see a bunch of smaller tokens in a novel order and interpret that.

by canjobear

0 subcomment

Right there in the second sentence. The sentence is ungrammatical because the n-gram model can’t handle long term dependencies. “To obtain such a large medical database of the joints are risk factors” — the verb should be singular and the predicate should be something else, like “to obtain such a large medical database … is difficult”

by currymj

0 subcomment

bigram-trigram language models (with some smoothing tricks to allow for out-of-training-set generalization) were state of the art for many years. Ch. 3 of Jurafsky's textbook (which is modern and goes all the way to LLMs, embeddings etc.) is good on this topic.
https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf
I don't know the history but I would guess there have been times (like the 90s) when the best neural language models were worse than the best trigram language models.

by aespinoza

1 subcomments

Would you be willing to write an article comparing the results ? Or share the code you used to test? I am super interested in the results of this experiment.

by ComplexSystems

0 subcomment

A few things:
First, modern LLMs can be thought, abstractly, as a kind of Markov model. We are taking the entire previous output as one state vector and from there we have a distribution to the next state vector, which is the updated output with another token added. The point is that there is some subtlety in what a "state" is. So that's one thing.
But the point of the usual Markov chain is that we need to figure out the next conditional probability based on the entire previous history. Making a lookup table based on an exponentially increasing history of possible combinations of tokens is impossible, so we make a lookup table on the last N tokens instead - this is an N-gram LLM or an N'th order Markov chain, where states are now individual tokens. It is much easier, but it doesn't give great results.
The main reason here is that sometimes, the last N words (or tokens, whatever) simply do not have sufficient info about what the next word should be. Often times some fragment of context way back at the beginning was much more relevant. You can increase N, but then sometimes there are a bunch of intervening grammatical filler words that are useless, and it also gets exponentially large. So the 5 most important words to look at, given the current word, could be 5 words scattered about the history, rather than the last 5. And this is always evolving and differs for each new word.
Attention solves this problem. Instead of always looking at the last 5 words, or last N words, we have a dynamically varying "score" for how relevant each of the previous words is given the current one we want to predict. This idea is closer to the way humans parse real language. A Markov model can be thought of as a very primitive version of this where we always just attend evenly to the last N tokens and ignore everything else. So you can think of attention as kind of like an infinite-order Markov chain, but with variable weights representing how important past tokens are, and which is always dynamically adjusting as the text stream goes on.
The other difference is that we no longer can have a simple lookup table like we do with n-gram Markov models. Instead, we need to somehow build some complex function that takes in the previous context and computes outputs the correct next-token distribution. We cannot just store the distribution of tokens given every possible combination of previous ones (and with variable weights on top of it!), as there are infinitely many. It's kind of like we need to "compress" the hypothetically exponentially large lookup table into some kind of simple expression that lets us compute what the lookup table would be without having to store every possible output at once.
Both of these things - computing attention scores, and figuring out some formula for the next-token distribution - are currently solved with deep networks just trying to learn from data and perform gradient descent until it magically starts giving good results. But if the network isn't powerful enough, it won't give good results - maybe comparable to a more primitive n-gram model. So that's why you see what you are seeing.

by robot-wrangler

0 subcomment

Previously: https://news.ycombinator.com/item?id=39213410

by ssivark

0 subcomment

The Markov property means that the next token is determined purely by the current token. Well, if it were a Hidden Markov Model, the next state would actually be determined by the current state, and the respective tokens would be a lossy representation of the states.
The problem with HMMs is that the sequence model (Markov transition matrix) accounts for much less context than even Tiny LLMs. One natural way to improve this is to allow the model to have more hidden states, representing more context -- called "clones" because these different hidden states would all be producing the same token while actually carrying different underlying contexts that might be relevant for future tokens. We are thus taking a non-Markov model (like a transformer) and re-framing its representation to be Markov. There have been sequence models with this idea aka Cloned HMMs (CHMMs) [1] or Clone-Structured Cognitive Graphs (CSCGs) [2]. The latter name is inspired by some related work in neuroscience, to which these were applied, which showed how these graphical models map nicely to "cognitive schemas" and are particularly effective in discovering interpretable models of spatial structure.
I did some unpublished work a couple of years ago (while at Google DeepMind) studying how CHMMs scale to simple ~GB sized language data sets like Tiny Stories [3]. As a subjective opinion, while they're not as good as small transformers, they do generate text that is surprisingly good compared with naive expectations of Markov models. The challenge is that learning algorithms that we typically use for HMMs (eg. Expectation Maximization) are somewhat hard to optimize & scale for contemporary AI hardware (GPU/TPU), and a transformer model trained by gradient descent with lots of compute works pretty well, and also scales well to larger datasets and model sizes.
I later switched to working on other things, but I still sometimes wonder whether it might be possible to cook up better learning algorithms attacking the problem of disambiguating contexts during the learning phase. The advantage with an explicit/structured graphical model like a CHMM is that it is very interpretable, and allows for extremely flexible queries at inference time -- unlike transformers (or other sequence models) which are trained as "policies" for generating token streams.
When I say that transformers don't allow flexible querying I'm glossing over in-context learning capabilities, since we still lack a clear/complete understanding and what kinds of pre-training and fine-tuning one needs to elicit them (which are frontier research questions at the moment, and it requires a more nuanced discussion than a quick HN comment).
It turns out, funnily, that these properties of CHMMs actually proved very useful [4] in understanding the conceptual underpinnings of in-context learning behavior using simple Markov sequence models instead of "high-powered" transformers. Some recent work from OpenAI [5] on sparse+interpretable transformer models seems to suggest that in-context learning in transformer LLMs might work analogously, by learning schema circuits. So the fact that we can learn similar schema circuits with CHMMs makes me believe that what we have is a learning challenge and it's not actually a fundamental representational incapacity (as is loosely claimed sometimes). In the spirit of full disclosure, I worked on [4]; if you want a rapid summary of all the ideas in this post, including a quick introduction to CHMMs, I would recommend the following video presentation / slides [6].
[1]: https://arxiv.org/abs/1905.00507
[2]: https://www.nature.com/articles/s41467-021-22559-5
[3]: https://arxiv.org/abs/2305.07759
[4]: https://arxiv.org/abs/2307.01201
[5]: https://openai.com/index/understanding-neural-networks-throu...
[6]: https://slideslive.com/39010747/schemalearning-and-rebinding...

by amelius

0 subcomment

Markov chains are very versatile. When you're holding a hammer ...

by bilsbie

0 subcomment

It wouldn’t be a Markov chain but I’d imagine you could make something functionally close to an LLM with creative uses of word frequency counts and probabilities over the massive amounts of text.

by spencerflem

1 subcomments

Iirc there was some paper that showed that LLMs could be converted to Markov chains and vice versa, but the size of the chain was much much higher

by AndrewKemendo

0 subcomment

Your example is too sparse to make a conclusion from
I’d offer an alternative interpretation: LLMs follow the Markov Decison modeling properties to encode the problem but use a very efficient policy for solver for the specific token based action space.
That is to say they are both within the concept of a “markovian problem” but have wildly different path solvers. MCMC is a solver for an MDP, as is an attention network
So same same, but different

by graycat

0 subcomment

In a Markov stochastic process, the past and the future are conditionally independent given the present.

by petesergeant

0 subcomment

"Attention", as a mechanism, and in practice the ability to produce comprehensible output far outside of the original source text.

by saya0909

1 subcomments

[dead]