FRESH

Hacker News

Home

Zpdf: PDF text extraction in Zig

217 points by lulzx

by lulzx

6 subcomments

I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.

~41K pages/sec peak throughput.

Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.

~5,000 lines, no dependencies, compiles in <2s.

Why it's fast:

  - Memory-mapped file I/O (no read syscalls)
  - Zero-copy parsing where possible
  - SIMD-accelerated string search for finding PDF structures
  - Parallel extraction across pages using Zig's thread pool
  - Streaming output (no intermediate allocations for extracted text)

What it handles:

  - XRef tables and streams (PDF 1.5+)
  - Incremental PDF updates (/Prev chain)
  - FlateDecode, ASCII85, LZW, RunLength decompression
  - Font encodings: WinAnsi, MacRoman, ToUnicode CMap
  - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)

by forgotpwd16

2 subcomments

  74910,74912c187768,187779
  < [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
  < corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
                                                                                                                                \251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  < std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
  ---
  >
  > [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
  > corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
  >
  > § D.27.2
  > 1954
  >
  > © ISO/IEC
  > N4950
  >
  > wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  > std::string mbstring = myconv.to_bytes(L"Hello\n");

Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)

by xvilka

0 subcomment

Test it on major PDF corpora[1]
[1] https://github.com/pdf-association/pdf-corpora

by fainpul

1 subcomments

These vibe coded tests are terrible:
https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...

by mpeg

1 subcomments

very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.
the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT
python bindings would be good too

by manmal

0 subcomment

Is there the possibility to hook in OCR for text blocks flattened into an image, maybe with some callback? That’s my biggest gripe with dealing with PDFs.

by littlestymaar

3 subcomments

- First commit 3hours ago.
- commit message: LLM-generated.
- README: LLM-generated.
I'm not convinced that projects vibe coded over the evening deserve the HN front page…
Edit: and of course the author's blog is also full of AI slop…
2026 hasn't even started I already hate it.

by ceving

0 subcomment

The spacing issue isn't working quite right yet.

    zpdf extract texbook.pdf | grep -m1 Stanford
    DONALD E. KNUTHStanford UniversityIllustrations by

by agentifysh

2 subcomments

excellent stuff what makes zig so fast

by odie5533

1 subcomments

Now we just need Python bindings so I can use it in my trash language of choice.

by pm2222

0 subcomment

What’s the format that’s perhaps free, easy to parse and render? Build one please.

by nullorempty

0 subcomment

Tomorrow's headlines
fpdf
jpdf
cpdf
cpppdf
bfpdf
ppdf
...
opdf

by amkharg26

1 subcomments

Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.
I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.
Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.