llama.cpp
Model servingAn open-source C/C++ runtime for running LLMs efficiently on CPUs and consumer GPUs.
llama.cpp runs language models with minimal dependencies, in portable C/C++, with aggressive quantization. It made running capable models on a laptop CPU or a modest GPU genuinely practical, and its GGUF format is a de facto standard.
It is the engine quietly powering many higher-level tools, including much of what Ollama does under the hood.
Where it's ideally used
A fit when models must run on constrained or CPU-only hardware, or when you want a tiny, dependency-light runtime.
Where it doesn't fit
A low-level tool — for high-concurrency GPU serving a dedicated engine is more efficient, and for an easy local setup a wrapper is friendlier.