vLLM
Model servingA high-throughput inference engine for serving open models efficiently in production.
vLLM is a serving engine built for throughput. Techniques like paged attention and continuous batching let it serve far more concurrent requests per GPU than a naive setup, with an OpenAI-compatible API.
It has become the default open-source choice for production self-hosted inference, and most new open models target it early.
Where it's ideally used
The right engine when you self-host a model under real concurrency and need to use GPU capacity efficiently.
Where it doesn't fit
More setup than warranted for a single-user local model — for that, a one-command runtime is the better tool.