- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
Why do I have to figure all of this out myself?
SWE Aider Cost Fast Fresh
Claude 3.7 70% 65% $15 77 8/24
Gemini 2.5 64% 69% $10 200 1/25
GPT-4.1 55% 53% $8 169 6/24
DeepSeek R1 49% 57% $2.2 22 7/24
Grok 3 Beta ? 53% $15 ? 11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet?
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
My take aways:
- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.
- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.
- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.
- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.
My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.
0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).
1, to win consumer growth they have continued to benefit on hyper viral moments, lately that was was image generation in 4o, which likely was technically possible a long time before launched. 2, for enterprise workloads and large API use, they seem to have focused less lately but the pricing of 4.1 is clearly an answer to Gemini which has been winning on ultra high volume and consistency. 3, for full frontier benchmarks they pushed out 4.5 to stay SOTA and attract the best researchers. 4, on top of all they they had to, and did, quickly answer the reasoning promise and DeepSeek threat with faster and cheaper o models.
They are still winning many of these battles but history highlights how hard multi front warfare is, at least for teams of humans.
I think it did very well - it's clearly good at instruction following.
Total token cost: 11,758 input, 2,743 output = 4.546 cents.
Same experiment run with GPT-4.1 mini: https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3... (0.8802 cents)
And GPT-4.1 nano: https://gist.github.com/simonw/1d19f034edf285a788245b7b08734... (0.2018 cents)
I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.
I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.
I probably spend 100$ a month on AI coding, and it's great at small straightforward tasks.
Drop it into a larger codebase and it'll get confused. Even if the same tool built it in the first place due to context limits.
Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.
As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
Honestly this feels underwhelming and surprising. Especially if you're coding with frameworks with breaking changes, this can hurt you.
I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?
It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.
Did they not expect this model to turn out as well as it did?
[1] https://x.com/sama/status/1889755723078443244
[2] https://github.com/openai/openai-cookbook/blob/6a47d53c967a0...
• GPT-4.1-mini: balances performance, speed & cost
• GPT-4.1-nano: prioritizes throughput & low cost with streamlined capabilities
All share a 1 million‑token context window (vs 120–200k on 4o-o3/o1), excelling in instruction following, tool calls & coding.
Benchmarks vs prior models:
• AIME ’24: 48.1% vs 13.1% (~3.7× gain)
• MMLU: 90.2% vs 85.7% (+4.5 pp)
• Video‑MME: 72.0% vs 65.3% (+6.7 pp)
• SWE‑bench Verified: 54.6% vs 33.2% (+21.4 pp)
{"error":
{"message":"Quasar and Optimus were stealth models, and
revealed on April 14th as early testing versions of GPT 4.1.
Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
For me, it was jaw dropping. Perhaps he didn't mean it the way it sounded, but seemed like a major shift to me.
Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.
I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.
>Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version
If anyone here doesn't know, OpenAI does offer the ChatGPT model version in the API as chatgpt-4o-latest, but it's bad because they continuously update it so businesses can't reliably rely on it being stable, that's why OpenAI made GPT 4.1.
> You're eligible for free daily usage on traffic shared with OpenAI through April 30, 2025.
> Up to 1 million tokens per day across gpt-4.5-preview, gpt-4.1, gpt-4o and o1
> Up to 10 million tokens per day across gpt-4.1-mini, gpt-4.1-nano, gpt-4o-mini, o1-mini and o3-mini
> Usage beyond these limits, as well as usage for other models, will be billed at standard rates. Some limitations apply.
I just found this option in https://platform.openai.com/settings/organization/data-contr...Is just this something I haven't noticed before? Or is this new?
The graphs presented don't even show a clear winner across all categories. The one with the biggest "number", GPT-4.5, isn't even in the best in most categories, actually it's like 3rd in a lot of them.
This is quite confusing as a user.
Otherwise big fan of OAI products thus far. I keep paying $20/mo, they keep improving across the board.
> GPT‑4.5 Preview will be turned off in three months, on July 14, 2025
Not all systems upgrade every few months. A major question is when we reach step-improvements in performance warranting a re-eval, redesign of prompts, etc.
There's a small bleeding edge, and a much larger number of followers.
Tool use ability feels ability better than gemini-2.5-pro-exp [2] which struggles with JSON schema understanding sometimes.
Llama 4 has suprising agentic capabilities, better than both of them [3] but isn't as intelligent as the others.
[1] https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
[2] https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
[3] https://github.com/rusiaaman/chat.md/blob/main/samples/llama...
It would be incredible to be able to feed an entire codebase into a model and say "add this feature" or "we're having a bug where X is happening, tell me why", but then you are limited by the output token length
As others have pointed out too, the more tokens you use, the less accuracy you get and the more it gets confused, I've noticed this too
We are a ways away yet from being able to input an entire codebase, and have it give you back an updated version of that codebase.
- It's basically GPT4o level on average.
- More optimized for coding, but slightly inferior in other areas.
It seems to be a better model than 4o for coding tasks, but I'm not sure if it will replace the current leaders -- Gemini 2.5 Pro, o3-mini / o1, Claude 3.7/3.5.
Sam acknowledged this a few months ago, but with another release not really bringing any clarity, this is getting ridiculous now.
https://platform.openai.com/docs/models/gpt-4.1
As far as I can tell there's no way to discover the details of a model via the API right now.
Given the announced adoption of MCP and MCP's ability to perform model selection for Sampling based on a ranking for speed and intelligence, it would be great to have a model discovery endpoint that came with all the details on that page.
Lies, damn lies and statistics ;-)
why would they deprecate when it's the better model? too expensive?
We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months
Here's something I just don't understand, how can ChatGPT 4.5 be worse than 4.1? Or the only thing bad is that the OpenAI naming ability?First it was the models stopped putting in effort and felt lazy, tell it to do something and it will tell you to do it your self. Now its the opposite and the models go ham changing everything they see, instead of changing one line, SOTA models rather rewrite the whole project and still not fix the issue.
Two years back I totally thought these models are amazing. I always would test out the newest models and would get hyped up about it. Every problem i had i thought if i just prompt it differently I can get it to solve this. Often times i have spent hours prompting starting new chats, adding more context. Now i realize its kinda useless and its better to just accept the models where they are, rather then try and make them a one stop shop, or try to stretch capabilities.
I think this release I won’t even test it out, im not interested anymore. I’ll probably just continue using deepseek free, and gemini free. I canceled my openai subscription like 6 months ago, and canceled claude after 3.7 disappointment.
Wait, wouldn’t this be a decent test for reasoning ?
Every patch changes things, and there’s massive complexity with the various interactions between items, uniques, runes, and more.
https://platform.openai.com/docs/models/gpt-4.1
https://platform.openai.com/docs/models/gpt-4.1-mini
https://platform.openai.com/docs/models/gpt-4.1-nano
But the price is what matters.
Broad Knowledge 25.1 Coder: Larger Problems 25.1 Coder: Line focused 25.1
The lack of availability in ChatGPT is disappointing, and they're playing on ambiguity here. They are framing this as if it were unnecessary to release 4.1 on ChatGPT, since 4o is apparently great, while simultaneously showing how much better 4.1 is relative to GPT-4o.
One wager is that the inference cost is significantly higher for 4.1 than for 4o, and that they expect most ChatGPT users not to notice a marginal difference in output quality. API users, however, will notice. Alternatively, 4o might have been aggressively tuned to be conversational while 4.1 is more "neutral"? I wonder.
@sama: underrated tweet
Source: https://x.com/stevenheidel/status/1911833398588719274
4.1 is 26.6% better at coding than 4.5. Got it. Also…see the em dash
gpt-4.1
- Input: $2.00
- Cached Input: $0.50
- Output: $8.00
gpt-4.1-mini
- Input: $0.40
- Cached Input: $0.10
- Output: $1.60
gpt-4.1-nano
- Input: $0.10
- Cached Input: $0.025
- Output: $0.40
I dont understand the constant complaining about naming conventions. The number system differentiates the models based on capability, any other method would not do that. After ten models with random names like "gemini", "nebula" you would have no idea which is which. Its a low IQ take. You dont name new versions of software as completely different software
Also, Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried. I have 15 years of backend experience at FAANG. Software will get automated, and it already is, people just havent figured it out yet
Gemini was drastically cheaper for image/video analysis, I'll have to see how 4.1 mini and nano compare.
All the solutions are already available on the internet on which various models are trained, albeit in various ratios.
Any variance could likely be due to the mix of the data.
GPT‑4.1, GPT‑4.1 mini GPT‑4.1 nano
I'll start with
800 bn MoE (probably 120 bn activated), 200 bn MoE (33 bn activated), and 7bn parameter for nano
*Then fix all your prompts over the next two weeks.
> Qodo tested GPT‑4.1 head-to-head against other leading models [...] they found that GPT‑4.1 produced the better suggestion in 55% of cases
The linked blog post goes 404: https://www.qodo.ai/blog/benchmarked-gpt-4-1/
- Coding accuracy improved dramatically
- Handles 1M-token context reliably
- Much stronger instruction following
Well, that didn't last long.