FRESH

Hacker News

GLM-OCR: Accurate × Fast × Comprehensive

202 points by ms7892

by coder543

5 subcomments

by alaanor

2 subcomments

There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

by aliljet

7 subcomments

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...
And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

by mikae1

0 subcomment

Text me back when there's a working PDF to EPUB conversion tool. I've been waiting (and searching for one) long enough. :D
EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.

by surfacedamage

0 subcomment

This might be a niche question, but does glm-ocr (or other libraries) have the ability to extract/interpret QR code data?

by ks2048

0 subcomment

I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...

by ThrowawayTestr

0 subcomment

What's the current SOTA for Japanese and Korean OCR? BalloonsTranslator has a great workflow but the models are pretty old.

by rdos

2 subcomments

Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.

by sinandrei

1 subcomments

Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?

by bugglebeetle

2 subcomments

I tested this pretty extensively and it has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works. For some reason, many of these models are trained in a way that results in these being excluded, despite these document sections often containing import details and context. Both versions of DeepseekOCR have the same problem. Of the others I’ve tested, dot-ocr in layout mode works best (but is slow) and then datalab’s chandra model (which is larger and has bad license constraints).

0 subcomment

by raphaelmolly8

0 subcomment