Comet Lab Atlas

A widely used open-source OCR toolkit supporting many languages and document layouts.

PaddleOCR is a mature OCR toolkit from Baidu's PaddlePaddle ecosystem. It covers text detection and recognition across many languages, plus document layout analysis and table and formula recognition.

It is a long-standing, well-supported project and a common base layer for turning scanned and image documents into text.

Where it's ideally used

A fit when the core problem is OCR — extracting text from scans and images — across multiple languages.

Where it doesn't fit

An OCR engine, not an end-to-end RAG ingestion pipeline — you build the surrounding steps yourself.