This all turned out to be mostly irrelevant in my subsequent programming career.
Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.
It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.
Thanks Andrej for the time and effort you put into your videos.
[0] https://www.manning.com/books/build-a-large-language-model-f...
Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.
This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.
At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:
https://github.com/stared/thinking-in-tensors-writing-in-pyt...
[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...
You computer an embedding vector for your documents or chunks of documents. And then you compute the vector for your users prompt, and then use the cosine distance to find the most semantically relevant documents to use. There are other tricks like reranking the documents once you find the top N documents relating to the query, but that’s basically it.
Here’s a good explanation
While reading through past posts I stumbled on a multi part "Writing an LLM from scratch" series that was an enjoyable read. I hope they keep up writing more fun content.
happy customer and have found it to be one of the best paid resources for learning mathematics in general. wish I had this when I was a student.
I think it went pretty well (was able to understand most of the logic and maths), and I touched on some of these terms.
There is certainly some hype, a lot of what is the market is just not viable.
The only thing is that nobody understand why they work so well. There are a few function approximation theorems that apply, but nobody really knows how to make them behave as we would like
So basically AI research is 5% "maths", 20% data sourcing and engineering, 50% compute power, and 25% trial and error
Does it ? I don't think so. All the math involved is pretty straightforward.
It was a really exciting time for me as I had pushed the team to begin looking at vectors beyond language (actions and other predictable perimeters we could extract from linguistic vectors.)
We had originally invented a lot of this because we were trying to make chat and email easier and faster, and ultimately I had morphed it into predicting UI decisions based on conversations vectors. Back then we could only do pretty simple predictions (continue vector strictly , reverse vector strictly or N vector options on an axis) but we shipped it and you saw it when we made hangouts, gmail and allo predict your next sentence. Our first incarnation was interesting enough that eric Schmidt recognized it and took my work to the board as part of his big investment in ML. From there the work in hangouts became all/gmail etc.
Bizarrely enough though under sundar, this became the Google assistant but we couldn’t get much further without attention layers so the entire project regressed back to fixed bot pathing.
I argued pretty hard with the executives that this was a tragedy but sundar would hear none of it, completely obsessed with Alexa and having a competitor there.
I found some sympathy with the now head of search who gave me some budget to invest in a messaging program that would advance prediction to get to full action prediction across the search surface and UI. We launched and made it a business messaging product but lost the support of executives during the LLM panic.
Sundar cut us and fired the whole team, ironically right when he needed it the most. But he never listened to anyone who worked on the tech and seemed to hold their thoughts in great disdain.
What happened after that is of course well known now as sundar ignored some of the most important tech in history due to this attitude.
I don’t think I’ll ever fully understand it.
Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.
Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.
Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.
Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.
Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.
Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.
Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.
Now for Large Language Models.
Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.
Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.
LLMs use a specific network layer called transformers, that includes something called an attention mechanism.
The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.
The model uses three different representations of the input sequence, called the key, query and value matrices.
Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated
The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.