PaddleOCR
Document extractionA widely used open-source OCR toolkit supporting many languages and document layouts.
PaddleOCR is a mature OCR toolkit from Baidu's PaddlePaddle ecosystem. It covers text detection and recognition across many languages, plus document layout analysis and table and formula recognition.
It is a long-standing, well-supported project and a common base layer for turning scanned and image documents into text.
Where it's ideally used
A fit when the core problem is OCR — extracting text from scans and images — across multiple languages.
Where it doesn't fit
An OCR engine, not an end-to-end RAG ingestion pipeline — you build the surrounding steps yourself.