FRESH

Hacker News

Why frontier LLMs can't read the hard documents without experts involved

25 points by chelm

by RugnirViking

1 subcomments

a clearly LLM written piece about how frontier models are struggling to get past 76% accuracy on their benchmarks (they call it a "wall") in OCR tasks. that is, feeding it a picture of a document and asking it to extract the text.
The benchmark site is here https://www.idp-leaderboard.org/
They say some specialist models get better results on their benchmarks (Nanonets OCR-3 85.9%)

by chelm

0 subcomment

tl;dr: years ago, Tesseract was the go to tool to extract text. Nowadays, vLLMs can not only extract the text and the layout but also context and provide structured data or even interpret or map data across documents. Prices dropped significantly, while extraction, classification and modification capabilities increased.
The intelligent document processing (a funny marketing term on top of OCR) market moves from "Can software extract the text", which is normally measured by benchmarks, to can software autonomously run "a" specific company process.
the fallback is called human in the loop, hallucination (LSTM vs. vLLM), prompt engineering.
proof me wrong: the hardest challenge is no longer the OCR accuracy but the integration and issue handling in production. Probably "an agentic team can handle this" ^^

by nullc

1 subcomments

I mean this is for handwritten OCR.. do humans do better?
I've been using Qwen3.6 to OCR stuff, primary receipts and it frequently accurately reads stuff on mangled/faded/folded documents that I have a hard time with... including handwritten stuff (though that's not flawless).

by madikz

0 subcomment