Capability layer
Document extraction
Turning PDFs, scans, and messy files into clean, structured text a model can use.
Unstructured
★Document extraction
A toolkit and API for turning PDFs, office files, and emails into clean, chunked text for AI.
Docling
Document extraction
IBM's open-source library for parsing documents into a structured, AI-ready representation.
LlamaParse
Document extraction
A hosted parsing service tuned for extracting clean structure from complex, messy PDFs.
Marker
Document extraction
An open-source tool that converts PDFs, ePub, and more into clean Markdown quickly and accurately.
MinerU
Document extraction
An open-source tool that extracts structured content from PDFs, preserving layout, tables, and formulas.
PaddleOCR
Document extraction
A widely used open-source OCR toolkit supporting many languages and document layouts.
Reducto
Document extraction
A commercial document-ingestion API focused on high-accuracy parsing of complex, regulated documents.
Surya
Document extraction
An open-source OCR and layout-analysis toolkit covering text, tables, and reading order in many languages.