FRESH

Hacker News

Home

Reasoning is not model improvement

52 points by QueensGambit

by simonw

3 subcomments

There are several key points in this that I don't think are accurate.
> When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result.
That's not true of the model itself, see my comment here which demonstrates it multiplying two large numbers via the OpenAI API without using Python: https://news.ycombinator.com/item?id=45683113#45686295
On GPT-5 it says:
> What they delivered barely moved the needle on code generation, the one capability that everything else depends on.
I don't think that holds up. GPT-5 is wildly better at coding that GPT-4o was (and got even better with GPT-5-Codex). A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.
From the conclusion:
> All [AI coding startups] betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards.
The models really don't need to get better at generating code right now for the economic impact to be profound. If progress froze today we could still spend the next 12+ months finding new ways to get better results for code out of our current batch of models.

by daxfohl

3 subcomments

I've seen research that shows that starting with reasoning models, and fine-tuning to slowly remove the reasoning steps, allows you to bake the reasoning directly into the model weights in a strong sense. Here's a recent example, and you can see the digits get baked into a pentagonal prism in the weights, allowing accurate multi-digit multiplication without needing notes: https://arxiv.org/abs/2510.00184. So, reasoning and tool use could be the first step, to collect a ton of training data to do something like this fine-tuning process.

by Terr_

1 subcomments

> These are not model improvements. They're engineering workarounds for models that stopped improving.
One might characterize it as an improvement in the document-style which the model operates upon.
My favorite barely-a-metaphor is that the "AI" interaction is based on a hidden document that looks like a theater script, where characters User and Bot are having a discussion. Periodically, the make_document_longer(doc) function (the stateless LLM) is invoked to to complete more Bot lines. An orchestration layer performs the Bot lines towards the (real) user, and transcribes the (real) user's submissions into User dialogue.
Recent improvements? Still a theater-script, but:
1. Reasoning - The Bot character is a film-noir detective with a constant internal commentary, not typically "spoken" to the User character and thus not "performed" by the orchestration layer: "The case was trouble, but I needed to make rent, and to do that I had to remember it was Georgia the state, not the country."
2. Tools - There are more stage-directions, such as "Bot uses [CALCULATOR] inputting [sqrt(5)*pi] and getting [PASTE_RESULT_HERE]". Regular programs are written to parse the script, run tools, and then replace the result.
Meanwhile, the fundamental architecture and the make_document_longer(doc) haven't changed as much, hence the author's title of "not model improvement."*

by QueensGambit

10 subcomments

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.
1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?
2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?
3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?
Would love to know your thoughts!

by sam0x17

1 subcomments

If you subscribe to extended mind theory and Merleau Ponty's brand of phenomenology, tools are just an extension of your cognitive process, and "shelling out" in this way is really to be expected of high intelligence, if not consciousness. Some would say it might even be a prerequisite for consciousness, that you need to be a being-in-the-world etc etc
Not to say that GPT is conscious, in its current form I think it certainly isn't, but rather I would say reasoning is a positive development, not an embarrassing one
I can't compute 297298*248 immediately in my head, and if I were to try it I'd have to hobble through a multiplicaion algorithm, in my head... it's quite simlar to what they're doing here, it's just they can wire it right into a real calculator instead of slowly running a shitty algo on wetware

by cat-whisperer

1 subcomments

I use thinking with claude cod extensively. It’s same as reasoning they just name it differently. It definitely helps, sometimes it feels like they came up with original thoughts and ideas

by gardnr

0 subcomment

I appreciate the insight in this post and I would argue that we need the right amount of loss to enable the next level. Trying to achieve "lossless representation" will be very expensive and may not bring the desired outcome. Humans build neurochemical bias networks specifically so we can discard information and still operate in diverse and dangerous environments.

by Legend2440

0 subcomment

This article is pretty much all wrong. Reasoning is not tool calling.
Reasoning is about working through problems step-by-step. This is always going to be necessary for some problems (logic solving, puzzles, etc) because they have a known minimum time complexity and fundamentally require many steps of computation.
Bigger models = more width to store more information. Reasoning models = more depth to apply more computation.

by toobulkeh

0 subcomment

This is just pedantic noise. "Don't improve the model by using code gen" - problem "Write better architecture" - solution
It's literally the same thing. Sure, OpenAI's branding of ChatGPT as a product with GPT-5 is confusing, because GPT-5 is both a MODEL and a PRODUCT (collection of models, including GPT-5).
But does it matter?

by ohadron

0 subcomment

> Each step moves further from "how do we build better models?" toward "how do we monetize the models we have?"
I don't think OpenAI launching ChatGPT Apps and Atlas signals they're pivoting.
It's just that when you raise that much money you must deploy it in any possible direction.

by skybrian

0 subcomment

The combination of LLM's and various other tools seems pretty powerful and it's the way industry is moving. Many combinations will be tried and whatever patterns work will be copied. LLM's and tools are co-evolving.
I'm not sure why we should be dissatisfied with that?

by xeckr

1 subcomments

For what it's worth, the brain also doesn't use its language centres to do arithmetic.

by raincole

1 subcomments

It makes so little sense that it's hard to comment.
> Unlike GPT-3, which at least attempted arithmetic internally (and often failed), o1 explicitly delegates computation to external tools.
How is it a bad thing? Does the author really believe this is a bad thing?
Even if we believe tech bros' most wild claim - AGI is around the corner - I still don't know why calling external tools makes an AGI less AGI.
If you ask Terence Tao what 113256289421x89831475287 is I'm quite sure he'd "call external tools." Does it make him less a mathematician?
Plus, this is not what people call "reasoning." The title:
> Reasoning Is Not Model Improvement
The content:
> (opening with how o1 is calling external tools for arithmetic)
...anyway, whatever. I guess it's a Cunningham's Law thing. Otherwise it's a bit puzzling why someone knows nothing about a topic had to write an article to make everyone know how clueless they are.

by meowface

0 subcomment

Almost everything in this seems incorrect.

by fudged71

0 subcomment

Huh? Integrating external tool calls for routine, deterministic operations in large language models is not a fallback or workaround but a deliberate, intelligent architectural choice that mirrors human adaptive expertise by combining efficient routine processing with flexible, innovative reasoning to handle novel challenges

by bossyTeacher

0 subcomment

Not sure why people in the comments are trying to explain LLM behaviour using brain behaviour. We have been here before. Remember the brain heat engine analogies.

by pessimizer

0 subcomment

I just think that LLM calls are the new transistor. Transistors don't do much, but you build computers out of them. LLMs do a lot more than transistors.
LLMs are very good at imitating moderate-length patterns. It can usually keep an apparently sensible conversation going with itself for at least a couple thousand words before it goes completely off the rails, although you never know exactly when it will go off the rails; it's very unlikely to be after the first sentence, far more likely to be after the twenty-first, and will never get past the 50th. If you inject novel input in periodically (such as reminding and clarifying prompts), you can keep the plate spinning longer.
So some tricks work right now to extend the amount of time the thing can go before falling into the inevitable entropy that comes from talking to itself too long, and I don't think that we should assume that there won't ever be a way to keep the plate spinning forever. We may be able to do it practically (making it very unusual for them to fall apart), or somebody may come up with a way to make them provably resilient.
I don't know if the current market leaders have any insight into how to do this, however. But I'm also sure that an LLM reaching for a calculator and injecting the correct answer into the context keeps that context useful for longer than if it hadn't.

by j45

0 subcomment

I wonder if the industry should be calling it simulated probabilistic reasoning vs the reasoning that deterministic software is able to reliably.