FRESH

Hacker News

Home

Building more with GPT-5.1-Codex-Max

478 points by hansonw

by johnfn

21 subcomments

I've been using a lot of Claude and Codex recently.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.

by hansonw

14 subcomments

Rest assured that we are better at training models than naming them ;D
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!

by boole1854

4 subcomments

Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.

by Reubend

5 subcomments

OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.

by simonw

3 subcomments

Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...
Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...

by atonse

1 subcomments

I just tried this out, and was VERY impressed with the speed of the plan mode. I was also totally fine with the code it wrote.
Then I made the mistake of saying "run npm run build and fix all issues" (something I've run probably 50 times across codex and cc in the past 2 months). CC does it pretty much 100% of the time. I walked away from Codex, and when I came back, it had installed 2 new node packages, and gone down some crazy rabbit hole with eslint and something else. (this was for 2 minor typescript errors)
After I reverted all its changes, had CC do it and it fixed it in about 30-60 seconds.
I'll try a few more times. Let's see.

by amluto

14 subcomments

I would love to see all the big players put 1% of the effort they put into model training into making the basic process of paying and signing in suck less.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.

by taurath

3 subcomments

These 2 sentences right next to each other stood out to me:
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.

by SunshineTheCat

0 subcomment

My observation has been that Codex tends to hit logical/data-driven/back-end tasks out of the park while doing weird, random nonsense with even simple UI tasks. This could me needing to improve how I phrase my prompts, but it will be interesting to see if it's improved in that arena at all.

by jasonthorsness

0 subcomment

"Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces."
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!

by 999900000999

6 subcomments

I really would prefer them to start creating customized models.
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.

by tosh

0 subcomment

Codex CLI 0.59 got released (but has no changelog text)
https://github.com/openai/codex/releases/tag/rust-v0.59.0

by kilroy123

0 subcomment

All the frontier models seem fairly neck to neck. I wonder which company or lab will finally leapfrog the others with some kind of breakthrough?
It sounded like Gemini 3 would be that but in my limit testing it didn't appear to be that.

by the__alchemist

3 subcomments

This is a tangent: Has anyone noticed that GPT-5.0 at some point started producing much faster, crappier answers, then 5.1 made it slower + better again? (Both in Thinking mode)

by agentifysh

1 subcomments

so this was arctic fox it seems, lot of us ended up downgrading to codex 5.0 because of the token burn was too much, i see codex max is a step up which is welcome but still unsure if they solved that github issue around tool use that impacts tokens
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend

by spectraldrift

1 subcomments

Weird how they only share three hand-picked evals, ignoring the evals where they were left in the dust like ARC-AGI2. This post is so misleading, I don't even know whether to trust the numbers they did share. One is just fraction of a percentage point away from Gemini 3 pro, which is awfully convenient for marketing and easy to hide. Very open, OpenAI.

by jwpapi

3 subcomments

I really hope one day Ill work on challenges that need these new type of agents.
Currently, I either need a fast agent that does what I want faster than I can type it (CRUD, forms, etc) or I need an agent to discuss a plan, ups and downs.
Whenever I try to give it a bigger task it takes a lot of time, and often is not what I’ve expected, which might be totally my fault or context specific, but as soon as I’m able to define the task properly I would prefer a faster model as it will be good enough, but faster. I really don’t have problems anymore that I can’t reasonable solve fast enough with this approach.
I’ve run multiple gpt-5 codex concurrent sessions in the cloud, but I didn’t accept one thing they did.
Eventually thinking through it, reading hack boom is faster than outsourcing the work for 30 minutes + 30 minutes to digest +30 minutes to change..

by EcommerceFlow

1 subcomments

Gemini 3 had a great 24 hour SOTA run for coding

by freediver

0 subcomment

First time that there is a worthy alternative to Claude Code. Codex Max solved a problem I had Claude Code fail multiple times. Gemini CLI was never a contender (between log in/activation/rate limits - wth), will say though that Gemini CLI has the nicest terminal UI.

by simianwords

2 subcomments

> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?

by tunesmith

3 subcomments

I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.

by epolanski

1 subcomments

Small ot question on the GPT cli tool.
I gave it a shot last month but I did not enjoy it due to the lack of a proper planning mode and being able to accept each edit independently, has it improved?

by spmartin823

6 subcomments

I still want something no one has, which is the ability to launch agents in different git worktrees simultaneously and check the results out on my main branch for testing when they are finished.

by cube2222

1 subcomments

Somewhat related, after seeing the praise for codex in the Sonnet 4.5 release thread I gave it a go, and I must say, that CLI is much worse than Claude Code (even if the model is great, I’m not sure where the issue really lies between the two).
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.

by NickFORGE

0 subcomment

We’ve been experimenting with a similar idea but in a browser-native environment — running real containers + a WebSocket terminal + multi-agent workflows. GPT-5.1 (Codex Max especially) seems to handle multi-step refactors a lot more cleanly, and chaining it through CLI agents has been surprisingly reliable.
Curious if anyone else is trying agent orchestration beyond the editor itself?

by highfrequency

0 subcomment

Is GPT-5.1-Codex better or worse than GPT-5.1 (Thinking) for straight up mathematical reasoning (ie if it is optimized for making code edits)? Said another way: what is the set of tasks where you expect GPT 5.1 to be better suited than GPT-5.1 Codex? Is it non-coding problems or non-technical problems?

by tptacek

0 subcomment

Is "compaction" a trained-in feature of the model, or just tooling around the model calls? Agents already do compaction.

by rolisz

0 subcomment

I got prompted to try it out on the web. It gave me this after 5 minutes:
"I wasn’t able to finish creating the new base homepage module template and updating every module to inherit from it within the available time. I did not make any changes or commits."
Told it to get back to work. Let's see how that goes.

by hereme888

0 subcomment

It's getting so cut-throat for who has the current SOTA model. Seems to be the big income driver.

by esafak

0 subcomment

How efficient is it; does it go through your subscription quota faster?

by syntaxing

3 subcomments

I rarely used Codex compared to Claude because it was extremely slow in GitHub copilot . Like maybe 2-5X slower than Claude Sonnet. I really wish they just made their models faster than “better”

by kytazo

1 subcomments

500 Internal Server Error.

by kachapopopow

1 subcomments

not sure if I am actually using 5.1-codex-max or just normal 5.1-codex (is there even 5.1-codex?) trying to continue work where gemini 3 left off and couple prompts in I had to switch back since it was reimplementing and changing things that didn't need changing and attempted to solve typos by making the code implementing those things work with the typo, weird behavior - probably is not compatible with the style gemini tries to solve problems.

by LZ_Khan

0 subcomment

Woah, metr results look impressive. Still looking exponential

by andai

0 subcomment

The graph showing higher performance for fewer thinking tokens is really interesting!
It would be even more interesting to see how Sonnet and Haiku compare with that curve.

by nowittyusername

1 subcomments

Glad to see evolution of proper context management. the automatic compacting is months overdue so happy to see it finally come.

by AIorNot

0 subcomment

Anyone compare this to sonnet 4.5 on full stack development yet

by andai

0 subcomment

Sizeable if veracious!

by LZ_Khan

0 subcomment

all i care about is performance on metr benchmark

by iamronaldo

2 subcomments

That was quick

by wilg

0 subcomment

I have been using GPT 5 High Fast in Cursor primarily over Codex, because Codex seems to take way longer and generally annoy me by doing strange CLI stuff, but hopefully I can switch to this new one. I also tried it against Gemini 3 Pro in Cursor and it's hard to tell but at least in some cases I felt like GPT5 was giving better results.

by bgwalter

0 subcomment

So they all release before the Nvidia numbers tonight. The real question is: How well can Nvidia hide the circular deals in the books?

by croes

2 subcomments

The new detergent now washes even whiter

by causal

0 subcomment

Sigh. Time to try it again I guess. I give OpenAI way more chances than it deserves.

0 subcomment

by Narciss

0 subcomment

Here we go again....

0 subcomment

by nakamoto_damacy

0 subcomment

It’s good but Gemini 3 beats it.