FRESH

Hacker News

Home

Show HN: A new benchmark for testing LLMs for deterministic outputs

58 points by khurdula

by jumploops

0 subcomment

I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].
Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.
You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.
(As a human, when I'm filling out a complex form, I'll often jump around the document)
Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].
[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.
[1]https://boundaryml.com/

by stared

2 subcomments

Thank you for sharing benchmark. However, the results are selective.
Why no Opus 4.7? Why Gemini 3.1 Pro is missing?
If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.
When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

by ossianericson

0 subcomment

Even when the JSON pass rate is at 97% the real challenge is that the accuracy gap is invisible at the record level. Nothing flags it without a baseline to check against. Parse error is rarely where it goes wrong in my experience. 'Valid' but incorrect data is what actually reaches production.

by zihotki

1 subcomments

I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.

by timxtokyo

1 subcomments

Would it be possible to add llm provider from glm5.1, minimax2.1? Those latest model have their parameters change significantly compare to previous gen

by jadbox

1 subcomments

Wow, Qwen3.5-35B is absolutely punching above its weight. Perhaps it's the best/cheapest model for just JSON operations?

by broyojo

1 subcomments

hmm why can't structured decoding be used?

by maxdo

2 subcomments

gpt 5.5 seems to be the recent leader overall, it make sense to include it , just to see what you trade off for speed/open source nature vs cutting edge leader.

by skylerbosley

0 subcomment

[flagged]

by ajaystream

0 subcomment

[flagged]

by alex_w_systems

0 subcomment

[flagged]

by Kbuckley454

0 subcomment

[flagged]

by alphainfo

0 subcomment

[flagged]

by moonlitemoney

0 subcomment

[dead]

by iLoveOncall

3 subcomments

This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?
> Our goal is to be the best general model for deterministic tasks
I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.

by dalberto

2 subcomments

A benchmark without Opus 4.6/4.7 feels incomplete.