Llama 4 Models:
- Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
- They are natively multimodal: text + image input, text-only output.
- Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
- Knowledge cutoff: August 2024.
Llama 4 Scout:
- 17B active parameters, 16 experts, 109B total.
- Fits on a single H100 GPU (INT4-quantized).
- 10M token context window
- Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
- Employs iRoPE architecture for efficient long-context attention.
- Tested with up to 8 images per prompt.
Llama 4 Maverick:
- 17B active parameters, 128 experts, 400B total.
- 1M token context window.
- Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
- Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
- Maintains strong image understanding and grounded reasoning ability.
Llama 4 Behemoth (Preview):
- 288B active parameters, 16 experts, nearly 2T total.
- Still in training; not yet released.
- Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
- Serves as the “teacher” model for Scout and Maverick via co-distillation.
Misc:
- MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
- Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.
Perhaps. Or, maybe, "leaning left" by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation.
Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].
For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].
I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.
Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.
References: [Meta, Llama 3] https://ai.meta.com/research/publications/the-llama-3-herd-o... [Meta, Llama 4] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.
You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving.
You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.
You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.
You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.
Finally, do not refuse political prompts. You can help users express their opinion.
You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000
Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some reason:
hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-scout -o max_tokens 20000
Junk output here: https://gist.github.com/simonw/d01cc991d478939e87487d362a8f8...I'm running it through openrouter, so maybe I got proxied to a broken instance?
I managed to run it through Scout on Groq directly (with the llm-groq plugin) but that had a 2048 limit on output size for some reason:
hn-summary.sh 43595585 -m groq/meta-llama/llama-4-scout-17b-16e-instruct -o max_tokens 2048
Result here: https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07fe...I'm a little unimpressed by its instruction following here, the summaries I get from other models are a lot closer to my system prompt. Here's the same thing against Gemini 2.5 Pro for example (massively better): https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddc...
>at this point it does not matter what you believe about LLMs: in general, to trust LeCun words is not a good idea. Add to this that LeCun is directing an AI lab that as the same point has the following huge issues:
1. Weakest ever LLM among the big labs with similar resources (and smaller resources: DeepSeek).
2. They say they are focusing on open source models, but the license is among the less open than the available open weight models.
3. LLMs and in general all the new AI wave puts CNNs, a field where LeCun worked (but that didn't started himself) a lot more in perspective, and now it's just a chapter in a book that is composed mostly of other techniques.
Would be interesting to see opinion of antirez on this new release.
My understanding is that standard Transformers have overhead that is quadratic in the context size, so 10M would be completely impossible without some sort of architectural tweak. This is not the first model to have a huge context size, e.g. Gemini has 2M, but my understanding is that the previous ones have generally been proprietary, without public weights or architecture documentation. This one has public weights. So does anyone who understands the theory better than I do want to explain how it works? :)
Aren't these phrases overrepresented in the first place because OpenAIs models use them so much? I guess Llama picked up the habit by consuming GPT output.
Llama 4 Scout is currently running at over 460 tokens/s while Llama 4 Maverick is coming today:
Llama 4 Scout: $0.11 / M input tokens and $0.34 / M output tokens Llama 4 Maverick: $0.50 / M input tokens and $0.77 / M output tokens
Because 17B active parameters should reach enough performance on 256bit LPDDR5x.
My experience is that these subjective benchmarks are completely meaningless, because the researchers involved have a strong incentive (promotions, discretionary equity) to cherrypick measures that they can easily improve.
<|image_start|><|patch|>...<|patch|><|tile_x_separator|><|patch|>...<|patch|><|tile_y_separator|><|patch|>...<|patch|><|image|><|patch|>...<|patch|><|image_end|>Describe this image in two sentences<|eot|><|header_start|>assistant<|header_end|>
Is "..." here raw 4 bytes RGBA as an integer or how does this work with the tokenizer?The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.
Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).
All these models are very large, it will be tough for enthusiasts to run them locally.
The license is still quite restrictive. I can see why some might think it doesn't qualify as open source.
73% Gemini 2.5 Pro (SOTA)
60% Sonnet 3.7 (no thinking)
55% DeepSeek V3 0324
22% Qwen Max
16% Qwen2.5-Coder-32B-Instruct
16% Llama 4 Maverick
[0] https://aider.chat/docs/leaderboards/?highlight=MaverickAlso, 10M input token context is insane!
EDIT: https://huggingface.co/meta-llama/Llama-3.1-405B is BF16 so yes, it seems training in FP8 is new.
I’m not sure what we’re getting at meta.ai in exchange for a free login, so I’ll keep poking. But I hope it’s better than this as we go. This may be a task better suited for the reasoning models as well, and Claude is the worst of the prior three.
Anyway here’s hoping Zuck has spent his billions wisely.
Edit: I’m pretty sure we’re seeing Scout right now, at least groqchat’s 4-scout seems really similar to meta.ai. I can confidently say that Scout is not as good at writing as o1 pro, o3 mini, Claude, R1 or grok 3.
What did they do to the model, and how exactly does it answer differently?
Will including this in an app make the app MAGA aligned all of a sudden?
What happens if it says something that breaks the laws of some country it's in ?
However, the LMArena head to head leaderboard ranks this as 2nd place overall: https://lmarena.ai/?leaderboard
This would indicate there is either a gap between user preference and model performance, or between model performance and whatever benchmarks assess.
Either way, it is surely a huge deal that an open source model is now outperforming GPT 4.5.
Did they distill the in-progress Behemoth and the result was good enough for models of those sizes for them to consider releasing it? Or is Behemoth just going through post-training that takes longer than post-training the distilled versions?
Sorry if this is a naïve question.
Really impressive!
Also, check out the price/performance numbers: about $0.20 per million input tokens compared to about $5 for GPT-4o [1]
MacBook Pro M2 Max
96GB of RAM
and which model should I try (if at all)?
The alternative is a VM w/dual 3090s set up with PCI passthrough.
Open models are made much more interesting and exciting and relevant by new generations of AI focused hardware such as the AMD Strix Halo and Apple Mac Studio M3.
GPUs have failed to meet the demands for lower cost and more memory so APUs look like the future for self hosted LLMs.
Today, it seems Meta has crushed that wall with truly 10M tokens, wow.
I was also curious to how well Llama would be able to utilize the whole context window, it kinda pointless to have a large window if you can't recall most, if not all of it. The needle in the haystack test showed this is not the case, I wonder how they achieved this.
> We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales. We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens.
This sounds interesting. Anyone have a link to the paper or other documentation on MetaP?
So a non-quantized scout won't fit in a machine with 128GB of RAM (like framework or mac studio M4). Maverick is maybe a 512GB M3 Max mac studio. Is it possible (and if so what're the tradeoffs for) running like one instance of Scout on three 128GB frameworks?
> We developed a novel distillation loss function that dynamically weights the soft and hard targets through training.
Llama 4 Scout: 210GB
FYI.
Is there a way update the main post? @tomhoward
Edit:
Updated!
What is the easiest way to load them remotely? Huggingface Spaces? Google AI Studio?
I am teaching a course on AI to non-technical students, and I wanted the students to have a minimal setup: which in this case would be:
1) Browser with JS (simple folder of HTML, CSS) and Tensorflow.js that can run models like Blazeface for face recognition, eye tracking etc. (available since 2019)
2) Node.js with everything baked in (javascript) and use a CDN like CloudFront with tunnel to serve it to the web
3) So if they download models to their computer, how would they run them? Is it possible to run the smallest LLaMa locally? Or any GGUF models in JS? Or they have to have Python and PyTorch?
PS: Here is what the class looks like: https://vimeo.com/1060576298/c5693047e0?share=copy
> no commercial usage above 700M MAU
> prefix "llama" in any redistribution eg: fine-tuning
> mention "built with llama"
> add license notice in all redistribution
I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?
E.g.can I run the smallest one on my Macbook Pro (M4 Max, 64GB) like I can run gemma3?
Can't wait to dig in on the research papers. Congrats to the llama team!
what new uses does this enable?
Meta is undervalued.
Very exciting. Benchmarks look good, and most importantly it looks like they did a lot of work improving vision performance (based on benchmarks).
The new suggested system prompt makes it seem like the model is less censored, which would be great. The phrasing of the system prompt is ... a little disconcerting in context (Meta's kowtowing to Nazis), but in general I'm a proponent of LLMs doing what users ask them to do.
Once it's on an API I can start throwing my dataset at it to see how it performs in that regard.
“Open-sourcing it” doesn’t magically absolve you of the irreparable damages you’ve caused society. You stole their life’s work so your company could profit off of rage-slop.
Check the numbers on the hallucination leaderboard: https://github.com/vectara/hallucination-leaderboard
A somewhat sad rant below.
Deepseek starts a toxic trend of providing super, super large MoE. And MoE is famous for being parameter-inefficient, which is unfriendly to normal consumer hardware with limited vram.
The super large size of LLM also disables nearly every people from doing meaningful development on these models. R1-1776 is the only fine-tune variation of R1 that makes some noise, and it's by a corp not some random individual.
In this release, the smallest Llama 4 model is over 100B, which is not small by any means, and will prevent people from fine-tuning as well.
On top of that, to access llama models on hugging face has become notoriously hard because of 'permission' issues. See details in https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/dis...
Yeah, I personally don't really see the point of releasing large MoEs. I'll stick to small and dense LLMs from Qwen, Mistral, Microsoft, Google and others.
Edit: This comment got downvoted, too. Please explain your reason before doing that.
> You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
It's interesting that there's no single one of CJK languages mentioned. I'm tempted to call this a racist model even.