- If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
- Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context:
SWE-Bench (Pro / Verified)
Model | Pro (%) | Verified (%)
--------------------+---------+--------------
GPT-5.2-Codex | 56.4 | ~80
GPT-5.2 | 55.6 | ~80
Claude Opus 4.5 | n/a | ~80.9
Gemini 3 Pro | n/a | ~76.2
And for terminal workflows, where agentic steps matter: Terminal-Bench 2.0
Model | Score (%)
--------------------+-----------
Claude Opus 4.5 | ~60+
Gemini 3 Pro | ~54
GPT-5.2-Codex | ~47
So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:- Claude is still ahead on strict coding + terminal-style tasks
- Gemini is better for huge context + multimodal reasoning
- GPT-5.2-Codex is strong but not clearly the new state of the art across the board
It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.
- I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging things.
One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.
- The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)
by kordlessagain
3 subcomments
- I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex securely (or not) in a container to test the model or build workflows, check out https://github.com/DeepBlueDynamics/codex-container.
It ships with 300+ MCP tools (crawl, Google search, Gmail/GCal/GDrive, Slack, scheduling, web indexing, embeddings, transcription, and more). Many came from tools I originally built for Claude Desktop—OpenAI’s MCP has been stable across 20+ versions so I prefer it.
I will note I usually run this in Danger mode but because it runs in a container, it doesn't have access to ENVs I don't want it messing with, and have it in a directory I'm OK with it changing or poking about in.
Headless browser setup for the crawl tools: https://github.com/DeepBlueDynamics/gnosis-crawl.
My email is in my profile if anyone needs help.
- I am suspecting there are some shills astroturfing for every LLM release. Or people are overreacting as a result of their unnecessary attachment.
by freedomben
7 subcomments
- The cybersecurity angle is interesting, because in my experience OpenAI stuff has gotten terrible at cybersecurity because it simply refuses to do anything that can be remotely offensive (as in the opposite of "defensive"). I really thought we as an industry had learned our lesson that blocking "good guys" (aka white-hats) from offensive tools/capabilities only empowers the gray-hat/black-hats and puts us at a disadvantage. A good defense requires some offense. I sure hope they change that.
- It's interesting that they're foregrounding "cyber" stuff (basically: applied software security testing) this way, but I think we've already crossed a threshold of utility for security work that doesn't require models to advance to make a dent --- and won't be responsive to "responsible use" controls. Zero-shotting is a fun stunt, but in the real world what you need is just hypothesis identification (something the last few generations of models are fine at) and then quick building of tooling.
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
- Fascinating to see the increasing acceptance of AI generated code in HN comments.
We've come a long way since gpt-3.5, and it's rewarding to see people who are willing to change their cached responses
- Somehow Codex for me is always way worse than the base models.
Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.
Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.
by simianwords
1 subcomments
- No one's saying this but this is around 40% costlier than the previous codex model. Price change is important.
by NitpickLawyer
0 subcomment
- > In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.
Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
- Can anyone elaborate on what they're referring to here?
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
- Codex code review has been astounding for my distributed team of devs. Very well spent money.
- would love to see some comparison numbers to Gemini and Claude, especially with this claim:
"The most advanced agentic coding model for professional software engineers"
- GPT 5.1 has been pure magic in VSCode via the Codex plugin. I can't tell any difference with 5.2 yet. I hope the Codex plugin gets feature parity with CC, Cursor, Kilo Code etc soon. That should increase performance a bit more through scaffolding.
I had assumed OpenAI was irrelevant, but 5.1 has been so much better than Gemini.
by postalcoder
4 subcomments
- It has become very quickly unfashionable for people to say they like the Codex CLI. I still enjoy working with it and my only complaint is that its speed makes it unideal for pair coding.
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
- We have made this model even better at programming in Windows. Give it a shot :)
- lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it
- I've been doing some reverse engineering recently and have found Gemini 3 Pro to be the best model for that, surprisingly much better than Opus 4.5. Maybe it's time to give Codex a try
- Why aren’t they making gpt-5.2-codex available in the API at launch?
- Constantly disconnects in VS Code extension. Have to switch to regular 5.2...
- My only concern with Codex is that it's not possible to delete tasks.
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
- I'm glad we are moving towards more of quality over speed
- > <PLACEHOLDER FOR FRONTEND HTML ASSETS>
> [ADD/LINK TO ROLLOUT THAT DISCOVERED VULNERABILITY]
What’s up with these in the article?
by OldGreenYodaGPT
0 subcomment
- GPT 5.2 has been very good in codex can't wait to try this new modal. Will see how it compares to Opus 4.5
by jasonthorsness
0 subcomment
- Recently I’ve had the best results with Gemini; with this I’ll have to go back to Codex for my next project. It takes time to get a feel for the capabilities of a model it’s sort of tedious having new ones come out so frequently.
by fellowniusmonk
1 subcomments
- In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
- I hope this makes a big jump forward for them. I used to be a heavy Codex user, but it has just been so much worse than Claude Code both in UX and in actual results that I've completely given up on it. Anthropic needs a real competitor to keep them motivated and they just don't have one right now, so I'd really like to see OpenAI get back in the game.
by ChrisMarshallNY
0 subcomment
- > For example, just last week, a security researcher using GPT‑5.1-Codex-Max with Codex CLI found and responsibly disclosed (opens in a new window) a vulnerability in React that could lead to source code exposure.
Translation: "Hey y'all! Get ready for a tsunami of AI-generated CVEs!"
- Fwiw, I had some well defined tickets in Jira assigned to me, and 5.2 absolutely crushed them. Still waiting on CI, but games over.
- The models aren't smart enough to be fully agentic. This is why Claude Code human-in-the-loop process is 100x more ergonomic.
by phplovesong
0 subcomment
- But when will they release SLOP-1.7-Jizz
- very minuscule improvement, I suspect GPT 5.2 is already coding model from the ground up and this codex model include "various optimization + tool" on tops
- They found one React bug and spend pages on "frontier" "cyber" nonsense. They make these truly marvelous models only available to "vetted" "security professionals".
I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.
EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.
- So, uh, I've been being and idiot and running it in yolo mode, and twice now it's gone and deleted the entire project directory, wiping out all of my work. Thankfully I have backups and it's my fault for playing with fire, but yeesh.
I have https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8... as a guard against that, for anyone that's stupid enough like me to run it in yolo mode and wants to copy it.
Codex also has command line options so you can specifically prohibit running rm in bash, so look those up too.
by monster_truck
2 subcomments
- [flagged]
by mistercheph
1 subcomments
- Gotta love only comparing the model to other openai models and just like yesterday's gemini thread, the vibes in this thread are so astroturfed. I guess it makes sense for the frontier labs to want to win the hearts and minds of silicon valley.
- Thanks gosh, we have so bloody competition.
The models are so good, unbelievable good. And getting better weekly, including pricing.
- I actually have 0 enthusiasm for this model. When GPT 5 came out it was clearly the best model, but since Opus 4.5, GPT5.x just feels so slow. So, I am going to skip all `thinking` releases from OpenAI and check them again only if they come up with something that does not rely so much on thinking.