llama.cpp

An open-source C/C++ runtime for running LLMs efficiently on CPUs and consumer GPUs.

llama.cpp runs language models with minimal dependencies, in portable C/C++, with aggressive quantization. It made running capable models on a laptop CPU or a modest GPU genuinely practical, and its GGUF format is a de facto standard.

It is the engine quietly powering many higher-level tools, including much of what Ollama does under the hood.

Where it's ideally used

A fit when models must run on constrained or CPU-only hardware, or when you want a tiny, dependency-light runtime.

Where it doesn't fit

A low-level tool — for high-concurrency GPU serving a dedicated engine is more efficient, and for an easy local setup a wrapper is friendlier.