~41K pages/sec peak throughput.
Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.
~5,000 lines, no dependencies, compiles in <2s.
Why it's fast:
- Memory-mapped file I/O (no read syscalls)
- Zero-copy parsing where possible
- SIMD-accelerated string search for finding PDF structures
- Parallel extraction across pages using Zig's thread pool
- Streaming output (no intermediate allocations for extracted text)
What it handles: - XRef tables and streams (PDF 1.5+)
- Incremental PDF updates (/Prev chain)
- FlateDecode, ASCII85, LZW, RunLength decompression
- Font encodings: WinAnsi, MacRoman, ToUnicode CMap
- CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs) 74910,74912c187768,187779
< [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
< corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
\251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
< std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
---
>
> [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
> corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
>
> § D.27.2
> 1954
>
> © ISO/IEC
> N4950
>
> wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
> std::string mbstring = myconv.to_bytes(L"Hello\n");
Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...
the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT
python bindings would be good too
- commit message: LLM-generated.
- README: LLM-generated.
I'm not convinced that projects vibe coded over the evening deserve the HN front page…
Edit: and of course the author's blog is also full of AI slop…
2026 hasn't even started I already hate it.
zpdf extract texbook.pdf | grep -m1 Stanford
DONALD E. KNUTHStanford UniversityIllustrations byfpdf
jpdf
cpdf
cpppdf
bfpdf
ppdf
...
opdf
I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.
Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.