FRESH

Hacker News

Home

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

570 points by oshrimpton

by wolttam

13 subcomments

> it is clear that actual intelligence has plateaued significantly.
> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse
These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.
Edit: My mention of data comes from this quote:
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.

by aesthesia

7 subcomments

Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.
I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.

by stalfie

8 subcomments

One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.
Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.

by solid_fuel

2 subcomments

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.

by stcg

1 subcomments

In the referenced benchmark GLM-5.2 (max) got 25% of all questions correct. GPT-5.5 (xhigh) got 57% correct.
https://artificialanalysis.ai/evaluations/omniscience
I'd much rather have some answer that I can verify than no answer to verify.
I don't want a model that says "I don't know", because I will verify the answer anyway.

by taffydavid

2 subcomments

> For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.
I'm already hallucinating about how this could work and it involves catapults

by frankohn

3 subcomments

I think hallucination rates are not a matter of model size but depends on the training of the model. They have been trained on a huge corpus of material that had overwhelmingly well formed questions and we'll formulated and correct answers. This is typically the case of books where the material is highly curated by experts in the field. In a book you never see a question which admit no answer and the book just reasoning and explaining why and how the question has no answer. Neither you will see a good question and the book explaining candidly it doesn't know the answer , because the way the book material is curated the author will omit discussing the question for which it has no answers.
In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.
As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.
Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.
So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.
Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.

by andai

1 subcomments

> GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies.
This implies that bigger models are more likely to hallucinate? That doesn't match my experience.

by hereme888

0 subcomment

Artificial Analysis says GPT-5.5 xhigh scores highest on AA-Omniscience accuracy. The article focuses on rate instead of overall accuracy. Those are different things: a model can answer more questions correctly overall while still being worse at abstaining when wrong.
Curiously, this post and article is the only submission and interaction the OP has made, and these claims support the product he's intending to release.

by xlii

2 subcomments

My anecdotal experience differs (though I hold ground that LLM evaluations are highly subjective and benchmarks are just as useful for LLMs as they are for dating websites users).
GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.
In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".

by nathan_compton

0 subcomment

Synthesizing a bunch of stuff I've read here lately, it seems like if OpenAI and Claude have actually found product market fit (generating code) then the question of hallucination is going to get less attention in the future. If the real money is in code generation (where there is a relatively clear acceptance criteria of at least "it runs and does what I wanted as far as I can tell") then there doesn't seem to be a lot of juice in pulling ones hair out on hallucination of facts.
It seems like for agentic coding, just making sure the AI can find the relevant documentation to establish a ground truth is probably sufficient.
Note that I'm distinguishing here between hallucination of what you might call "free facts" and hallucination of material which deviates from what is in the context itself. The latter seems both a tractable problem and one which will improve coding agent functionality. But the former seems like its no longer on the critical path, probably because its hard.

by hyperpape

0 subcomment

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
Such a weird thing to start with. The legal status of Fable does not mean that it's not intelligent. If anything, the problem is the opposite, someone thinks it's too intelligent (and/or that Anthropic wouldn't share its last gen intelligent models on the terms the government demanded).

0 subcomment

1 subcomments

by giancarlostoro

1 subcomments

I wonder if this is what a “Minimally Viable LLM” looks like. I often wonder how much of an LLM do you need before you can just shove a bigger context Window and any dynamic knowledge content to it like a PDF or markdown file to give it knowledge outside of its training data. I feel like LLMs don’t need more data they just need to be refined.

by aubanel

1 subcomments

> Bigger is not better
The article uses the example of GLM being smaller than DeepSeek, yet better on hallucinations as "smaller can be good too"
But the GLM family itself is scaling up fast: GLM-5.x family is 754B, double the previous generation of GLM-4.x
> comes within just 4 points of GPT-5.5 and 9 points of Fable 5
9 percentage points IS a big difference

by cwillu

0 subcomment

Please don't editorialize titles unless the original title is misleading.

by chazeon

0 subcomment

GPT-5.5 must have serious issues; it is fast, but quality-wise, it is just not good. It read one LaTeX paper (which is not long) and can spell my name wrong. This is GPT-5.5-high.

by wiether

2 subcomments

Purely anecdotal, but when OpenAI removed Codex-5.3 from the ChatGPT sub and forced me to move to GPT-5.5, the result was far worse than what I was enjoying with Codex.
And, of course, it was burning 10 times more tokens for this output.

by EbNar

2 subcomments

The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone tho these issues?

by nghnam

0 subcomment

I’d be careful about reading too much into these numbers. The test only looks at cases where the model doesn’t know the answer, so it doesn’t show how often users will actually see hallucinations.

by czk

0 subcomment

if you're benchmaxxing then maybe bigger doesnt always mean better, but for general intelligence and big model smell, that couldn't be further from the truth
the oss models are impressive but it's pretty clear how quickly they fall off when you try to use them outside of a narrow set of problems they benchmarked well on when compared to opus/5.5

by spwa4

6 subcomments

Why is everyone expecting LLMs to be like the Star Trek computer? I wonder if anyone's ever measured what the hallucination rate of a human is.

by raincole

0 subcomment

> meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer.
From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.

by orbital-decay

0 subcomment

DS v4 is an undertrained snapshot, which is mentioned in their model card. The full version is supposed to be released later and have multimodal input. That said, hallucination rate likely depends on the training policy and different optimization tradeoffs a lot more than on the scale.

by EbNar

1 subcomments

The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone to these issues?

by gcanyon

1 subcomments

> it is clear that actual intelligence has plateaued significantly
N=1, but I disagree strongly. I'm writing a hard-science science fiction story, and the physics of it is at (and frankly, beyond) my skillset. The story's plot has had to change over a dozen times as I realized errors in my application of physics in the story.
Throughout, I've been reviewing the physics with LLMs, mainly Gemini 3.1 Pro Preview, but also with Claude and OpenAI. Often I have the LLMs debate each other -- "My friend [another model] said XYZ about the physics, is that right or wrong?" In almost all cases, Gemini explains why the other models are wrong, and when I send its explanation to them, they concede it is right and they are wrong.
As I said, I did the above checks literally dozens of times as I wrote the story. And everything was dialed in: no further issues claimed by anyone, me or the LLMs.
Not with Fable. I managed to get it to review the story while it was running, and it listed out something like ten issues: some minor, some general knowledge-based, and two that were impressive:
1. It pointed out where Gemini (and I, and other LLMs) had missed a , resulting in values about 152 times larger than they should have been. I sent that to Gemini and it fully conceded that it had been wrong all along. 2. It pointed out a simple inconsistency in the application of special relativity (I thought I had that at least dialed in, but no :-/ ) that affected a very specific plot point. The story is novella-length, about 28,000 words long, and this is a point that was mentioned in the first two pages, and then not again until the very last page. And it's obvious, once you realize it. And I missed it. Gemini missed it. Claude and ChatGPT missed it.
Only Fable found it. Again, N=1, but that was a remarkable run I got out of it in the couple days it was available.

by stevenhubertron

0 subcomment

The more I have been using 5.2 the more I have been impressed with it. And I’ve just been using the usually neutered ollama version.

by zuzuen_1

0 subcomment

I think we need better classification and taxonomy on erroneous LLM behaviors than the catch-all term "hallucinate"..

by nextaccountic

1 subcomments

>GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.
What about using two models, with a smaller model used for this kind of negative reasoning?

by metalspot

2 subcomments

hallucination is good for tasks that have an external oracle like computer programming

by anArbitraryOne

0 subcomment

It's fine if it hallucinates, as long as it sounds overconfident

by brown_munda

0 subcomment

GLM 5.2 is really impressive at design as well. Overall loving it.

by gitaarik

0 subcomment

This reminds me of the Missing Dollar Riddle [1], where the listener is deliberately put on a wrong thinking path, to fool it.
With your own logical thinking you might never come to this confusion, and if you never heard this riddle before, you might be tricked by it.
But as we grow in life, and get experience, we learn about these riddles and aren't fooled as easily anymore.
Maybe it'll work like that for LLMs too?
[1]: https://en.wikipedia.org/wiki/Missing_dollar_riddle

by ecommerceguy

0 subcomment

It's very much looking like OpenAI will be bailed out, along with all the other Capex'ers. I say this because the trump admin (I feel partially at fault because I voted for him) has indicated they will be bailing out the entire ai stargate from intel and amd to amazon and anthropic. I know alot of everyday folks that absolutely hate - passionately HATE - anything and everything tech bro. Downvote all you want, that's the reality. They see Palantir et al as evil and demonic.

by dgellow

0 subcomment

> One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
We really don't know what the actual reason is given the politics at play. I would bet more on the Trump administration looking for any excuse to punish Anthropic

by remix2000

1 subcomments

Calling llm slop "hallucinating" is so counter-productive imo. After all, LLMs are just a variant of markov chains and as such this technology isn't able to discern falsehoods from truths. It's like trying to use a barometer to tell the time.

by Naveja

0 subcomment

loving glm 5.2 personally

by metalman

0 subcomment

to paraphrase the title, "in the land of the insane, those who are meerly delusional will rule"

by Tanxsinxlnx

0 subcomment

[flagged]

by corlinp

0 subcomment

[flagged]

by cws_ai_buddy

0 subcomment

[flagged]

by jingpostmedia

0 subcomment

[flagged]

by anchorapi

0 subcomment

[dead]

by balgaly

0 subcomment

[flagged]

by flexagoon

0 subcomment

[dead]

by chinallm_ai

0 subcomment

[dead]

by Anoian

0 subcomment

[dead]

by Ozzie-D

0 subcomment

[dead]

by abracadobre

0 subcomment

This is where I asked GPT 5.5
"they say u hallucinate 3x more than GLM 5.2, whats your comeback to this? do i need to dump u? $article"