FRESH

Hacker News

Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

39 points by sidmanchkanti21

by dang

2 subcomments

> happy to run additional documents if people want to share examples
I've got one! The pdf of this out-of-print book is terrible: https://archive.org/details/oneononeconversa0000simo. The text is unreadably faint, and the underlying text layer is full of errors, so copy-paste is almost useless. Can your software extract usable text?
(I'll email you a copy of the pdf for convenience since the internet archive's copy is behind their notorious lending wall)

by bambax

0 subcomment

by think4coffee

1 subcomments

Congrats on the launch! You mention that you're SOTA on benchmarks. Can you share your research, or share which benchmark you used?

by lajr

1 subcomments

Hey, congratulations on the launch. Just noticed a discrepancy in the financial 10K example:
There is a section near the start where there are 4 options: Large accelerated filer, Non-accelerated filer, Accelerated filer, or Smaller reporting company.
In this option, "Large accelerated filer" is checked on the PDF, but "Non-accelerated filer" is checked on the Markdown.

by Ishirv

0 subcomment

Super interesting stuff. I’m a fan - been a pulse customer for a while. However, I’ve found it has trouble with things that need intelligence like quotes meaning to repeat the previous line. Is that something you’re working on or is that not the right use case for pulse?

by scottydelta

1 subcomments

AI models will eventually do this natively. This is one of the ways for models to continue to get better, by doing better OCR and by doing better context extraction.
I am already seeing this trend in the recent releases of the native models (such as Opus 4.5, Gemini 3, and especially Gemini 3 flash).
It's only going to get better from here.
Another thing to note is, there are over 5 startups right now in YC portfolio doing the same thing and going after a similar/overlapping target market if I remember correctly.

by aryan1silver

1 subcomments

looks really cool, congrats on the launch! are you guys using something similar to docling[https://github.com/docling-project/docling]?

by TZubiri

0 subcomment

How does it handle tables with invisible lines and inconsistent justification? (For example one centered column and one right justified column.

by throw03172019

3 subcomments

Congrats on launch! We have been using this for a new feature we are building in our SaaS app. It’s results were better than Datalab from our tests, especially in the handwriting category.

by DIVx0

0 subcomment

can't sign up with gmail or "personal" email addresses? What if I want to evaluate but I am not ready to inundated with sales calls? My 'work' email domain is one that many vendors would love to see in their CRM. I always sign up with disposables first.
I guess I should thank you for saving my time? Plenty of others in this space.

by sidcool

0 subcomment

by canadiantim

0 subcomment

Can you increase correctness by giving examples to the model? And key terms or nouns expected?

by mikert89

3 subcomments

by asdev

1 subcomments