but the real pain is always in the second and third batch. when formats change subtly. if reducto becomes the system that adapts without you babysitting it, that's where it may win. continuity's the moat imo among the competitors
From the typography and layout to the line-work down to how the gradients in the, in fashion, large logotype at the bottom of the footer are tied in by using texture.
Was it in house, or an agency? I'd love to see some more of whoever's work it was
Do you store the uploaded doc from free/test account?
I am the founder of http://DocRouter.AI, https://github.com/analytiq-hub/doc-router. Available online as http://app.docrouter.ai (no paywall, working on Stripe integration).
Pre-seed stage, looking for collaborators and funding.
Ours is open source. Think of us as an ERP for documents, LLM prompts, and extraction schemas. We run on top of litellm, as a portability layer, so we support all major LLM models.
Extraction schema can be configured though a drag-and-drop UI, or inline by editing JSON.
A tagging mechanism is used to determine which prompts run on which documents - so we don't run all prompts against all documents, which would be a quadratic problem.
APIs are available for all functions (upload docs, configure prompts & schemas, download results).
We are designed for human-in-the-loop workflows, where precise processing of financial, insurance, or medical data is essential.
We see two main use cases, right now:
1 - Accelerating AI adoption in other engineering organizations, who don't have time to build the AI pipelines in-house. In this use case, we can quickly develop a specialized UI for you (Lovable, Bolt + adapting the generated UI with Cursor for your use case). In this play, we are a data layer accelerator for your AI solution.
2 - Solving point problems in document processing in insurance, medical, biotech, revenue cycle management, supply chain... In this use case, the business pain point we solve is manual processing of documents in an ERP that may not have the latest AI features. DocRouter.AI sits inline, in front of the ERP, picking selected faxes, emails, docs - processing them with LLMs, and inserting structured data into your ERP, saving on human labor.
The 2nd use case is something we see again and again in the industry. Legacy ERP vendors are slow to adopt AI processing, and businesses sitting on top of an ERP find it prohibitive to switch ERPs. These businesses are nickel and dimed over any small new ERP feature (...want to support PDFs not just TIFFs? that's thousands of dollars!... want to call APIs into the ERP? that's charged per API call!...)
They desperately need solutions to solve business workflows with AI, to free up FTEs to do more interesting work.
Here is a 30m recorded talk from a Mindsone meetup: https://community.mindstone.com/annotate/article_AuDOhLA5awW... where I showed how DocRouter.AI can be used to grade middle school quizzes with AI, with a teacher-in-the-loop. This was a "1st use case" application, with a custom UI, specialized to the application.
For the grade-school-quizzes-with-AI application, we generated the quiz rubric synthetically with AI, as we did the student quizzes. The rubric is embedded in the LLM prompt. The quiz PDF is tagged with the same tag as the corresponding rubric prompt (so it's graded with the corresponsing rubric).
This idea of matching a quiz agains a quiz rubric comes up again and again in many other examples. The same mechanism can be used to:
- Match invoices with purchase orders
- Or, to verify invoices against allowed amounts in a contract.
- Or, to check if standard operating procedures for transportation security comply with government or insurance rules.
- Or, to check if medical documents comply with a set of insurance rules. This is a use case I developed over a year and a half in the Durable Medical Equipment space, as consulting work (and it inspired the design of the DocRouter as a more general solution).
The idea of a system just keeping track of prompts, extraction schemas and documents - while very simple, it can solve many problems, in different verticals.
In fact, I believe that, when multiple products can solve the same problem, it is the simplest product that has the best chance to succeed.
So, a lot of thinking goes into keeping the design simple, the APIs complete - removing unnecessary artifacts. If new features are needed, they can be added as an external block, so the central function of the DocRouter does not need to become cluttered.
Here are tech slides from my Boston PyData presentation, where I showed how DocRouter.AI was implemented, using React, NextJS, FastAPI, and with a MongoDB back end: https://docs.google.com/presentation/d/14nAjSmZA1WGViqSk5IZu...
(I did not know how to program React before this... but in the brave new world of Cursor and Windsurf editors, I can venture into bold new directions!)
Ping me if you are interested to collaborate, or just if you are interested in the space!
Our thesis is that the space is large enough, and there's a market for multiple players. We do specialize on business workflows with human-in-the-loop, and we offer consulting services for project integration / turnkey delivery.
Andrei Radulescu-Banu, andrei@analytiqhub.com
YC seems to fund quite many document extraction companies, even within the same batch:
- Pulse (YC W24): https://www.ycombinator.com/companies/pulse-3
- OmniAI (YC W24): https://www.ycombinator.com/companies/omniai
- Extend (YC W23): https://www.ycombinator.com/companies/extend
How do you differentiate from these? And how do you see the space evolving as LLMs commoditize PDF extraction?
https://github.com/mathpix/mpxpy
Disclaimer: I'm the founder. Reducto does cool stuff on post processing (and other input formats), but some people have told me Mathpix is better at just getting data out of PDFs accurately.
Just want to say how energizing it is to see this space maturing through thoughtful products like Extend and Reducto. Congrats to both for your Series A. I’d also mention GetOmni, as they’re doing great work leading the open-source front with their ZeroX project. We’ve learned a lot by observing your execution, and frankly, anyone serious about document intelligence tracks this ecosystem closely. It’s been encouraging to see ideas we were exploring early last year reflected in your recent successes. No shame there; good ideas often converge over time.
When we started fundraising (previous to GPT-4o), few investors believed LLMs would meaningfully disrupt this space. Finding the right supporters meant enduring a lot of rejection and delayed us quite a bit. Raising is always hard, and especially in Spain, where even a modest €500K pre-seed round typically requires proven MRR in the order of €10K.
We’re earlier-stage, but strongly aligned in product philosophy. Especially in the belief that the challenge isn’t just parsing PDFs. It’s building a feedback loop so fast and intuitive that deploying new workflows feels like development, not consulting. That’s what enables no-code teams to actually own automation.
From our experience in Europe, the market feels slower. Legacy tools like Textract still hold surprising inertia, and even €0.04/page can trigger pushback, signaling deeper friction tied to organizational change. Curious if US-based teams see the same, or whether pricing and adoption are more elastic. We’ve also heard “we’ll build this internally in 3 weeks” more times than we can count—usually underestimating what it takes to scale AI-based workflows reliably.
One experiment we’re excited about is using AI agents to ease the “blank page” problem in workflow design. You type: “Given a document, split it into subdocuments (contract, ID, bank account proof), extract key fields, and export everything into Excel.” The agent drafts the initial pipeline automatically. It helps DocOps teams skip the fiddly config and get straight to value. Again, no magic—just about removing friction and surfacing intent.
Some broader observations that align with what others here have said:
- Parsing/extraction isn’t a long-term moat. Foundation models keep improving and are beginning to yield bounding boxes. Not perfect yet, but close. - Moats come from orchestration-first strategies and self-adaptive systems: rapid iteration, versioning, observability, and agent-assisted configuration using visual tools like ReactFlow or Langflow. Basically, making an easier life to the pipeline owner. - Prompt-tuning (via DSPY, human feedback, QA) holds promise for adaptability but is still hard to expose through intuitive UX—especially for semi-technical DocOps users without ML knowledge. - Extraction confidence remains a challenge. No method fully prevents hallucinations. We shared our mitigation approach here: http://bit.ly/3T5nB3h. OCR errors are a major contributor—we’ve seen extractions marked high-confidence despite poor OCR input. The extraction logic was right, but we failed to penalize for OCR confidence (we’re fixing that). -Excel files are still a nightmare. We’re experimenting with methods like this one (https://arxiv.org/html/2407.09025v1), but large, messy files (90+ tabs, 100K+ rows) still break most approaches.
I’d love to connect with other founders in this space. Competition is energizing, and the market is big enough for multiple winners. You guys, along with llamaparse, are spearheding from what I see the movement. Also, incumbents are moving fast. Like Snowflake + Landing AI partnership, but fragmentation is probably inevitable. Feels like the space will stratify fast, some will vanish, some will thrive quietly, and a few might become the core infrastructure layer.
We’re small, building hard, and proud to be part of this wave. Kudos again to @kbyatnal and @adit_a for raising the bar, would be great to chat anytime or even offer some workspace if you ever visit Spain!