| Benchmark | 3 Pro | 2.5 Pro | Sonnet 4.5 | GPT-5.1 |
|-----------------------|-----------|---------|------------|-----------|
| Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% |
| ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% |
| GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% |
| AIME 2025 | | | | |
| (no tools) | 95.0% | 88.0% | 87.0% | 94.0% |
| (code execution) | 100% | - | 100% | - |
| MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% |
| MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% |
| ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% |
| CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% |
| OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 |
| Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% |
| LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 |
| Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% |
| SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% |
| t2-bench | 85.4% | 54.9% | 84.7% | 80.2% |
| Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43 |
| FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% |
| SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% |
| MMLU | 91.8% | 89.5% | 89.1% | 91.0% |
| Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% |
| MRCR v2 (8-needle) | | | | |
| (128k avg) | 77.0% | 58.0% | 47.1% | 61.6% |
| (1M pointwise) | 26.3% | 16.4% | n/s | n/s |
n/s = not supportedEDIT: formatting, hopefully a bit more mobile friendly
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.
Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.
Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)
[1] https://blog.google/technology/ai/introducing-pathways-next-...
I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).
This model is not a modification or a fine-tune of a prior model
Is that common to mention that? Feels like they built something from scratchOur most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities
<=200K tokens • Input: $2,00 / Output: $12,00
> 200K tokens • Input: $4,00 / Output: $18,00
Knowledge cut off: Jan. 2025
I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.
Gemini 3 Pro gets 31.1% on ARG-AGI-2
https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
And I really don't think I'm alone in this.
NVDA is down 3.26%
I think a specialized hardware for training models is the next big wave in China.
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
Who is training LLMs with CPUs?
Still taking nearly a year to train and run post training safety and stability tuning.
With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.
wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
https://www.google.com/search?q=gemini+u.s.+senator+rape+all...