I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
It would be interesting to actively track how far long each progressive model gets...
> are the following parenthesis balanced? ((())))
> No, the parentheses are not balanced.
> Here is the breakdown:
Opening parentheses (: 3
Closing parentheses ): 4
... following up with:> what about these? ((((())))
> Yes, the parentheses are balanced.
> Here is the breakdown:
Opening parentheses (: 5
Closing parentheses ): 5
... and uses ~5,000 tokens to get the wrong answer.You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
“Model can count to 5”… tick.
“Model can count to 10”… sorry you gotta wait til 2028.
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
{
"model": "gpt-5.2-2025-12-11",
"instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
"input": "((((())))))",
"temperature": 0
}
> Lower reasoning effortThe reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
{
"model": "gpt-5.2-2025-12-11",
"output_text": "Yes",
"reasoning": {
"effort": "none",
"summary": null
},
"usage": {
"input_tokens": 26,
"output_tokens": 5,
"total_tokens": 31,
"output_tokens_details": {
"reasoning_tokens": 0
}
}
}So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?
Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).