FRESH

Hacker News

Home

Unlimited OCR: One-shot long-horizon parsing

424 points by ingve

by robotswantdata

5 subcomments

Very interesting.
The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

by peatmoss

9 subcomments

I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible.
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.

by KitN

1 subcomments

"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."
Class Act.

by novoreorx

0 subcomment

FYI, "Unlimited OCR Works" is a Fate/stay night reference. The original "Unlimited Blade Works" is a magic whose entire premise is copying weapons other people forged

by janpeuker

0 subcomment

Paper under https://arxiv.org/abs/2606.23050
(As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)

by lacoolj

0 subcomment

This looks more promising than what Mistral just launched (coincidence?????? i think not.)
This approach feels like it could be used for image gen as well (in some combination). Read/view image, start drawing image using illustrator/inkscape/etc (or just SVG), then fill in with what was missed after

by aliljet

0 subcomment

How does this compare with infinty parser 2 which seemed to be running the table on every other OCR tool (https://huggingface.co/datasets/allenai/olmOCR-bench). To be fair, there's no single winning OCR benchmark and this isn't showing up anywhere yet..

by arboles

1 subcomments

I'm going to sound like I live under a rock, but what is the true reason companies open-source genuinely good software?
Shouldn't Baidu (or Google) hoard it for themselves to extract the value in a way the competition isn't be able to imitate?

by pmarreck

4 subcomments

my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

by manipalite

0 subcomment

Whatever happened to Reducto, was very promising 12-15 months ago

by gettingoverit

0 subcomment

How does it compare against Finereader? Comparisons against transformer-based OCRs don't really tell anything. The last time I checked, neither of them were of "OCR this legal document" quality.

0 subcomment

by overflowy

0 subcomment

What are the requirements for running this locally?

by piterrro

1 subcomments

can someone explain how is this different than feeding the VLM model one page at a time?

by alansaber

1 subcomments

We've invented chunking? We are so back.

by AaronNewcomer

0 subcomment

[flagged]

by aozelai

0 subcomment

[flagged]

by madikz

0 subcomment

[flagged]

by jingpostmedia

0 subcomment

[flagged]

by swordlucky666

0 subcomment

[dead]

by shevy-java

0 subcomment

Is this an academic paper that is published in year xyz, but in +5 years nobody will remember it anymore?

by ramon156

0 subcomment

I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies

by Oras

12 subcomments

OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?