- This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...
- Instead of markdown -> LLM to get JSON, you can just train a slightly bigger model which you can constrain decode to give JSON rightaway.
https://huggingface.co/nanonets/Nanonets-OCR2-3B
We recently published a cookbook for constrained decoding here:
https://nanonets.com/cookbooks/structured-llm-outputs/
by binalpatel
1 subcomments
- This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.
https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?
>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens)
>~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)
(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)
- How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.
- I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
- How does this compare to dots.ocr? I got fantastic results when I tested dots.
https://github.com/rednote-hilab/dots.ocr
by fmirkowski
0 subcomment
- having worked with paddleocr, tesseract and many other ocr tools before this is still one of the best and smoothest ocr experiences ive ever had, deployed in minutes
by constantinum
0 subcomment
- What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.
Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.
Other tools in this space:
LLMWhisperer/Unstract(AGPL)
Reducto
Extend Ai
LLamaparse
Docling
by mechazawa
1 subcomments
- Is only bun supported or also regular node?
- Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.