vLLM

A high-throughput inference engine for serving open models efficiently in production.

vLLM is a serving engine built for throughput. Techniques like paged attention and continuous batching let it serve far more concurrent requests per GPU than a naive setup, with an OpenAI-compatible API.

It has become the default open-source choice for production self-hosted inference, and most new open models target it early.

Where it's ideally used

The right engine when you self-host a model under real concurrency and need to use GPU capacity efficiently.

Where it doesn't fit

More setup than warranted for a single-user local model — for that, a one-command runtime is the better tool.