FRESH

Hacker News

Home

The case for zero-error horizons in trustworthy LLMs

79 points by daigoba66

by hu3

2 subcomments

> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.
I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.

by grey-area

3 subcomments

To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.

by pants2

2 subcomments

Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?
This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.

by kenjackson

6 subcomments

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.
It would be interesting to actively track how far long each progressive model gets...

by staticshock

3 subcomments

LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.

by simianwords

0 subcomment

Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.

0 subcomment

by burningion

0 subcomment

Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:
> are the following parenthesis balanced? ((())))
> No, the parentheses are not balanced.
> Here is the breakdown:
```
    Opening parentheses (: 3
    Closing parentheses ): 4
```
... following up with:
> what about these? ((((())))
> Yes, the parentheses are balanced.
> Here is the breakdown:
```
     Opening parentheses (: 5
     Closing parentheses ): 5
```
... and uses ~5,000 tokens to get the wrong answer.

by BugsJustFindMe

4 subcomments

People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.
You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.

by cadamsdotcom

0 subcomment

Isn’t this just a benchmark?
“Model can count to 5”… tick.
“Model can count to 10”… sorry you gotta wait til 2028.

by dwa3592

1 subcomments

Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.

by justinator

1 subcomments

One! Two! Five!

by throwuxiytayq

1 subcomments

> This is surprising given the excellent capabilities of GPT-5.2.
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?

by charcircuit

1 subcomments

Why didn't OpenAI finetune the model to use the python tool it has for these tasks?

by cineticdaffodil

0 subcomment

Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..

by itsmyro

0 subcomment

bruh

by jeremie_strand

0 subcomment

[dead]

by Bmello11

0 subcomment

[flagged]

by emp17344

8 subcomments

[flagged]

by simianwords

1 subcomments

This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.
```
   {
  "model": "gpt-5.2-2025-12-11",
  "instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
  "input": "((((())))))",
  "temperature": 0
   }
```
> Lower reasoning effort
The reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
```
   {
  "model": "gpt-5.2-2025-12-11",
  "output_text": "Yes",
  "reasoning": {
    "effort": "none",
    "summary": null
  },
  "usage": {
    "input_tokens": 26,
    "output_tokens": 5,
    "total_tokens": 31,
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }
```
}
So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?

by simianwords

2 subcomments

There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?
Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...

by bigstrat2003

0 subcomment

Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.

by parliament32

1 subcomments

> This is surprising given the excellent capabilities of GPT-5.2
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).