by llamasushi
19 subcomments
- The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.
Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.
The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.
by unsupp0rted
15 subcomments
- This is gonna be game-changing for the next 2-4 weeks before they nerf the model.
Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.
Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.
Then a couple months later they’ll release Opus 4.7 and go through the cycle again.
My allegiance to these companies is now measured in nerf cycles.
I’m a nerf cycle customer.
- I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.
I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.
by dave1010uk
1 subcomments
- The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.
There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded but then hides that info from the user.
The risks are a bit scary, especially around CBRNs. Opus is still only ASL-3 (systems that substantially increase the risk of catastrophic misuse) and not quite at ASL-4 (uplifting a second-tier state-level bioweapons programme to the sophistication and success of a first-tier one), so I think we're fine...
I've never written a blog post about a model release before but decided to this time [1]. The system card has quite a few surprises, so I've highlighted some bits that stood out to me (and Claude, ChatGPT and Gemini).
[0] https://www.anthropic.com/claude-opus-4-5-system-card
[1] https://dave.engineer/blog/2025/11/claude-opus-4.5-system-ca...
- Seeing these benchmarks makes me so happy.
Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.
This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.
Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.
But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.
Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.
by futureshock
1 subcomments
- A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.
The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.
https://arcprize.org/leaderboard
https://arcprize.org/blog/oai-o3-pub-breakthrough
- Notes and two pelicans: https://simonwillison.net/2025/Nov/24/claude-opus/
- Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.
On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.
EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.
However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.
EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.
EDIT 3: It looks like Sonnet also consumes credits in this mode. I had it make some simple CSS changes to a single HTML file with Opusplan, and it cost me $0.95 (way too much, in my opinion). I'll try manually switching between Opus for the plan and regular Sonnet for the next test.
- All the users in the comments here complaining about API limits and usage limits have missed the boat. You're not the target audience. This AI is not for you. It's not for consumers and end users. This AI is for the multi-billion and trillion-dollar businesses who are signing massive contracts to get these models enabled for their entire company. I've been using Sonnet 4.5 for months and never had a usage limit ever. And I used every model before that, all day and all night, and never once saw any mention of usage limits. Never saw a bill either. If "price per token" is a concern to you then you already lost.
by jasonthorsness
2 subcomments
- I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.
[1] https://github.com/jasonthorsness/tree-dangler
by hebejebelus
2 subcomments
- On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.
A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.
- We've added support for opus 4.5 to v0 and users are making some pretty impressive 1-shots:
https://x.com/mikegonz/status/1993045002306699704
https://x.com/MirAI_Newz/status/1993047036766396852
https://x.com/rauchg/status/1993054732781490412
It seems especially good at threejs / 3D websites. Gemini was similarly good at them (https://x.com/aymericrabot/status/1991613284106269192); maybe the model labs are focusing on this style of generation more now.
- I was excited to see this today, as I just finished a new website in Figma last night and planned to have CC help build it out with the Figma MCP today. CC started up with Opus 4.5 and... Not off to a good start. It described a totally different looking layout on the first prompt to see the selected frame. I asked why:
You're right to call that out. Looking back at what happened:
1. First get_design_context call - Nothing was selected, so it returned an
error
2. Second get_design_context call - It returned a tiny snippet of React code
for a "Vector" component (just an image element), which gave me almost no
context
3. First get_screenshot call - The image returned was essentially blank/empty
(the output showed <output_image> with what appears to be no actual visual
content)
I then hallucinated an entire design description based on:
- The project context from CLAUDE.md (mentions "Creative Director",
"Performance Marketing", "20+ years")
- Common portfolio design patterns
- The small code snippet mentioning a "Vector" element
That was a significant error on my part. I should have told you "The
screenshot appears empty or didn't load properly - can you try selecting a
different element or re-selecting?" instead of fabricating a description.
by jumploops
3 subcomments
- > Pricing is now $5/$25 per million [input/output] tokens
So it’s 1/3 the price of Opus 4.1…
> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens
…and potentially uses a lot less tokens?
Excited to stress test this in Claude Code, looks like a great model on paper!
- Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.
And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.
Because in my experience, Haiku sits at this weird middle point where, if you have a well defined task, you can use a smaller/faster/cheaper model than Haiku, and if you don't, then you need to reach for a bigger/slower/costlier model than Haiku.
- 80% on swebench verified is incredible. a year ago the best model was at ~30%. i wonder if we'll soon have a convincingly superhuman coding capability (even in a narrow field like kernel optimization).
this is the most interesting time for software tools since compilers and static typechecking was invented.
- Opus 4.5's scaling is impressive on benchmarks, but the usual caveats apply: benchmark saturation is real, and we're seeing diminishing returns on evals that test pattern-matching vs. genuine reasoning. The more relevant question: has anyone stress-tested this on novel problems or complex multi-step reasoning outside training data distributions? Marketing often showcases 'advanced math' and 'code generation' where the solutions exist in training data. The claim of 'reasoning improvement' needs validation on genuinely unfamiliar problem classes.
- The LLMs rate of improvement has really slowed down. This looks like a minor improvement in terms of accuracy and big gains from efficiency.
- Interesting that the number of hn comments on big model announcements seems to be dropping. I recall previous ones easily surpassing 1k
Maybe models are starting to get good enough/ levelling off?
- So far this seems like a huge downgrade from Opus 4.1. Please add back 4.1 as an option...
- Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.
I’ve always found Opus significantly better than the benchmarks suggested.
LFG
- As much as I am excited by the price, the tools they called "the advanced tool"[1] look so useful to me; Tool search, programmatic tool calling (smolagents.CodeAgent by HF), and tool use examples (in-context learning).
They said that they have seen 134K tokens for tool definition alone. That is insane. I also really liked the puzzle game video.
[1] https://www.anthropic.com/engineering/advanced-tool-use
by nickandbro
1 subcomments
- "Create me a SVG of a PS4 controller"
Gemini 3.0 Pro:
https://www.svgviewer.dev/s/CxLSTx2X
Opus 4.5:
https://www.svgviewer.dev/s/dOSPSHC5
I think Opus 4.5 did a bit better overall, but I do think eventually frontier models will eventually converge to a point where the quality will be so good it will be hard to tell the winner.
by irthomasthomas
0 subcomment
- I wish it was open-weights so we could discuss the architectural changes. This model is about twice as fast as 4.1, ~60t/s Vs ~30t/s. Is it half the parameters, or a new INT4 linear sparse-moe architecture?
- Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.
by johnnycombin
0 subcomment
- the most overhyped model ever, not even close to Gemini3 or GPT5.1 after 8h of complex tasks.
- Can't wait to try Opus 4.5
We just evaluated it for Vectara's grounded hallucination leaderboard: it scores at 10.9% hallucination rate, better than Gemini-3, GPT-5.1-high or Grok-4.
https://github.com/vectara/hallucination-leaderboard
by chaosprint
1 subcomments
- SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.
by nickandbro
0 subcomment
- I use the following models like so nowadays:
Gemini is great, when you have gitingested the code of pypi package and want to use it as context. This comes in handy for tasks and repos outside the model's training data.
5.1 Codex I use for a narrowly defined task where I can just fire and forget it. For example, codex will troubleshoot why a websocket is not working, by running its own curl requests within cursor or exec'ing into the docker container to debug at a level that would take me much longer.
Claude 4.5 Opus is a model that I feels trustworthy for heavy refactors of code bases or modularizing sections of code to become more manageable. Often it seems like the model doesn't leave any details out and the functionality is not lost or degraded.
by jaakkonen
1 subcomments
- Tested this today for implementing a new low-frequency RFID protocol to Flipper Zero codebase based on a Proxmark3 implementation. Was able to do it in 2 hours with giving a raw psk recording alongside of it and some troubleshooting. This is the kind of task the last generation of frontier models was incapable of doing. Super stoked to use this :)
by andreybaskov
2 subcomments
- Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.
- Does anyone here understand "interleaved scratchpads" mentioned at the very bottom of the footnotes:
> All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), and default sampling settings (temperature, top_p).
I understand scratchpads (e.g. [0] Show Your Work: Scratchpads for Intermediate Computation with Language Models) but not sure about the "interleaved" part, a quick Kagi search did not lead to anything relevant other than Claude itself :)
[0] https://arxiv.org/abs/2112.00114
- After experimenting with Gemini 3, I still felt like Sonnet 4.5 had the edge. So I'm very excited to start playing with this in the wild.
by AbstractH24
0 subcomment
- Amazing how every company's newest model performs best in the benchmarks they share in the announcment....
- “For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet.” — seems like anthropic has finally listened!
- Gemini 3 in antigravity is significantly better than Claude code with either Opus or Sonnet that I struggle to see how they can compete. And I'm someone with the 100 dollar/month plan.
I can't even use Opus for a day before it runs out before. This will make it better but Antigravity has way better UI and also bug solving.
by morgengold
0 subcomment
- I'm on a Claude Code Max subscription. Last days have been a struggle with Sonnet 4.5 - Now it switched to Claude Opus 4.5 as default model. Ridiculous good and fast.
by starkparker
0 subcomment
- Would love to know what's going on with C++ and PHP benchmarks. No meaningful gain over Opus 4.1 for either, and Sonnet still seems to outperform Opus on PHP.
- The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...
- With less token usage, cheaper pricing, and enhanced usage limits for Opus, Anthropic are taking the fight to Gemini and OpenAI Codex. Coding agent performance leads to better general work and personal task performance, so if Anthropic continue to execute well on ergonomics they have a chance to overcome their distribution disadvantages versus the other top players.
- I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.
- What causes the improvements in new AI models recently? Is it just more training, or is it new, innovative techniques?
- I wish the article's graphs weren't distorted by skipping so much of the scale to make it look like a more significant difference than it is. But it does looks impressive.
- Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.
- One thing I didn't see mentioned is raw token gen speed compared to the alternatives. I am using Haiku 4.5 because it is cheap (and so am I) but also because it is fast. Speed is pretty high up in my list of coding assistant features and I wish it was more prominent in release info.
- Tested this building some PRs and issues that codex-5.1-max and gemini-3-pro were strugglig with
It planned way better in a much more granular way and then execute it better. I can't tell if the model is actually better or if it's just planning with more discipline
- Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.
by agentifysh
1 subcomments
- again the question of concern as codex user is usage
its hard to get any meaningful use out of claude pro
after you ship a few features you are pretty much out of weekly usage
compared to what codex-5.1-max offers on a plan that is 5x cheaper
the 4~5% improvement is welcome but honestly i question whether its possible to get meaningful usage out of it the way codex allows it
for most use cases medium or 4.5 handles things well but anthropic seems to have way less usage limits than what openai is subsidizing
until they can match what i can get out of codex it won't be enough to win me back
edit: I upgraded to claude max! read the blog carefully and seems like opus 4.5 is lifted in usage as well as sonnet 4.5!
by mutewinter
0 subcomment
- Some early visual evaluations: https://x.com/mutewinter/status/1993037630209192276
by rutagandasalim
1 subcomments
- claude opus 4.5 is an incredible model
i just one-shoted https://aithings.dev with it
- Does it follow directions? I’ve found Sonnet 4.5 to be useless for automated workflows because it refuses to follow directions. I hope they didn’t take the same RLHF approach they did with that model.
by PilotJeff
1 subcomments
- More blowing up of the bubble with anthropic essentially offering compute/LLM for below cost. Eventually the laws of physics/market will take over and look out below.
- Ok, the victorian lock puzzle game is pretty damn cool way to showcase the capabilities of these models. I kinda want to start building similar puzzle games for models to solve.
- Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?
- I've almost ran out of Claude on the Web credits. If they announce that they're going to support Opus then I'm going to be sad :'(
- https://lifearchitect.ai/models-table/
by thot_experiment
1 subcomments
- It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.
It is emphatically not, it has never been, I have used both models extensively and I have never encountered a single situation where Sonnet did a better job than Opus. Any coding benchmark that has Sonnet above Opus is broken, or at the very least measuring things that are totally irrelevant to my usecases.
This in particular isn't my "oh the teachers lie to you moment" that makes you distrust everything they say, but it really hammers the point home. I'm glad there's a cost drop, but at this point my assumption is that there's also going to be a quality drop until I can prove otherwise in real world testing.
- What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most
- Does anyone have a benchmark that clearly distinguishes the larger models? I would think that the high parameter count models would have capabilities distinct from the smaller ones, that would easily be read out. For example, Opus 4 has apparently memorized many books. If you ask it just right (to get around the infuriating copyright controls), it will complete a paragraph from The Wealth of Nations or Aristotle’s Nicomachean Ethics in Ancient Greek. That cannot be possible on a smaller model that needs to compress more.
- This one is different. IYKYK...
by whitepoplar
1 subcomments
- Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?
by rishabhaiover
3 subcomments
- Is this available on claude-code?
by CuriouslyC
0 subcomment
- I hate on Anthropic a fair bit, but the cost reduction, quota increases and solid "focused" model approach are real wins. If they can get their infrastructure game solid, improve claude code performance consistency and maintain high levels of transparency I will officially have to start saying nice things about them.
by I_am_tiberius
1 subcomments
- Still mad at them because they decided not to take their users' privacy serious. Would be interested how the new model behaves, but just have a mental lock and can't sign up again.
by throwaway2027
0 subcomment
- Oh that's why there were only 2 usage bars.
- This is great. Sonnet 4.5 has degraded terribly.
I can get some useful stuff from a clean context in the web ui but the cli is just useless.
Opus is far superiour.
Today sonnet 4.5 suggested to verify remote state file presence by creating an empty one locally and copy it to the remote backend.
Da fuq?
University level programmer my a$$.
And it seems like it has degraded this last month.
I keep getting braindead suggestions and code that looks like it came from a random word generator.
I swear it was not that awful a couple of months ago.
Opus cap has been an issue, happy to change and I really hope the nerf rumours are just that.
Undounded rumours and the defradation has a valid root cause
But honestly sonnet 4.5 has started to act like a smoking pile of sh**t
by kachapopopow
0 subcomment
- slightly better at react and spacial logic than gemini 3 pro, but slower and way more expensive.
- great, paying $100/m for claude code, this stops me from switching to gemini 3.0 for now.
- Love the competition. Gemini 3 pro blew me away after being spoiled by Claude for coding things. Considered canceling my Anthropic sub but now I’m gonna hold on to it.
The bigger thing is Google has been investing in TPUs even before the craze. They’re on what gen 5 now ? Gen 7? Anyway I hope they keep investing tens of billions into it because Nvidia needs to have some competition and maybe if they do they’ll stop this AI silliness and go back to making GPUs for gamers. (Hahaha of course they won’t. No gamer is paying 40k for a GPU.)
by GodelNumbering
1 subcomments
- The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.
by tschellenbach
0 subcomment
- Ok, but can it play Factorio?
by cyrusradfar
1 subcomments
- I'm curious if others are finding that there's a comfort in staying within the Claude ecosystem because when it makes a mistake, we get used to spotting the pattern. I'm finding that when I try new models, their "stupid" moments are more surprising and infuriating.
Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.
Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?
- this is very impressive! as much as I love Claude though, is it just me or their limit is much lower compared to others (Gemini and GPT)? At the moment I'm subscribed to Google One AI ($20) which gives me the most value with the 2tb google drive and Cursor ($20). I've subscribed to GPT and Claude as well in the past, I find that I was hitting the limit much faster in Claude compared to all the others, it made me reluctant to subscribe again. from the blog post it seems like they've been prioritising the Max users most of the time?
- So are we in agreement that claude is the thinking persons model and openai is for the masses
- that chart at the start is egregious
- They lowered the price because this is a massive land grab and is basically winner take all.
I love that Antrhopic is focused on coding. I've found their models to be significantly better at producing code similar to what I would write, meaning it's easy to debug and grok.
Gemini does weird stuff and while Codex is good, I prefer Sonnet 4.5 and Claude code.
- this is quite a good
- 80% and 77% is not that much lol
- Got the river crossing one:
https://claude.ai/chat/0c583303-6d3e-47ae-97c9-085cefe14c21
Still fucked up one about the boy and the surgeon though:
https://claude.ai/chat/d2c63190-059f-43ef-af3d-67e7ca1707a4
- The first chart is straight from "how to lie in charts"..