One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
Then I made the mistake of saying "run npm run build and fix all issues" (something I've run probably 50 times across codex and cc in the past 2 months). CC does it pretty much 100% of the time. I walked away from Codex, and when I came back, it had installed 2 new node packages, and gone down some crazy rabbit hole with eslint and something else. (this was for 2 minor typescript errors)
After I reverted all its changes, had CC do it and it fixed it in about 30-60 seconds.
I'll try a few more times. Let's see.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.
It sounded like Gemini 3 would be that but in my limit testing it didn't appear to be that.
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend
Currently, I either need a fast agent that does what I want faster than I can type it (CRUD, forms, etc) or I need an agent to discuss a plan, ups and downs.
Whenever I try to give it a bigger task it takes a lot of time, and often is not what I’ve expected, which might be totally my fault or context specific, but as soon as I’m able to define the task properly I would prefer a faster model as it will be good enough, but faster. I really don’t have problems anymore that I can’t reasonable solve fast enough with this approach.
I’ve run multiple gpt-5 codex concurrent sessions in the cloud, but I didn’t accept one thing they did.
Eventually thinking through it, reading hack boom is faster than outsourcing the work for 30 minutes + 30 minutes to digest +30 minutes to change..
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
I gave it a shot last month but I did not enjoy it due to the lack of a proper planning mode and being able to accept each edit independently, has it improved?
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
Curious if anyone else is trying agent orchestration beyond the editor itself?
"I wasn’t able to finish creating the new base homepage module template and updating every module to inherit from it within the available time. I did not make any changes or commits."
Told it to get back to work. Let's see how that goes.
It would be even more interesting to see how Sonnet and Haiku compare with that curve.