MinerU
Document extractionAn open-source tool that extracts structured content from PDFs, preserving layout, tables, and formulas.
MinerU converts PDFs and other documents into machine-readable formats, with particular care for scientific and technical content — it keeps tables intact and converts formulas to LaTeX.
It handles scanned documents through OCR and is aimed at producing high-quality training and retrieval data from dense source material.
Where it's ideally used
A fit when source PDFs are technical or scientific and formulas and tables must survive the conversion.
Where it doesn't fit
More machinery than needed for simple, text-only documents that any basic extractor handles.