FRESH

Hacker News

Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

87 points by adammajcher

by sync

3 subcomments

This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...

by prats226

0 subcomment

Instead of markdown -> LLM to get JSON, you can just train a slightly bigger model which you can constrain decode to give JSON rightaway. https://huggingface.co/nanonets/Nanonets-OCR2-3B
We recently published a cookbook for constrained decoding here: https://nanonets.com/cookbooks/structured-llm-outputs/

by binalpatel

1 subcomments

This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.
https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?
>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)
(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)

by v3ss0n

1 subcomments

How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.

by hersko

4 subcomments

I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?

by sgc

1 subcomments

How does this compare to dots.ocr? I got fantastic results when I tested dots.
https://github.com/rednote-hilab/dots.ocr

by fmirkowski

0 subcomment

having worked with paddleocr, tesseract and many other ocr tools before this is still one of the best and smoothest ocr experiences ive ever had, deployed in minutes

by constantinum

0 subcomment

by mechazawa

1 subcomments

by cess11

1 subcomments

Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.