Model layer
Model serving
Runtimes that turn model weights into a fast, callable endpoint.
Ollama
★Model serving
The simplest way to download and run open-weight models locally, with one command.
Baseten
Model serving
A platform for deploying and serving custom and open models on autoscaling infrastructure.
Fireworks AI
Model serving
A hosted inference platform focused on fast, low-cost serving of open models.
Groq
Model serving
A hosted inference service running open models on custom hardware for very low latency.
llama.cpp
Model serving
An open-source C/C++ runtime for running LLMs efficiently on CPUs and consumer GPUs.
LM Studio
Model serving
A desktop app for discovering, downloading, and running local models with a graphical UI.
LocalAI
Model serving
An open-source, OpenAI-compatible API you can run locally over many model backends.
Modal
Model serving
A serverless platform for running Python — including model inference — on cloud GPUs.
Replicate
Model serving
A platform for running and fine-tuning open models behind a simple hosted API.
SGLang
Model serving
A fast open-source serving runtime with structured generation and high throughput.
Text Generation Inference
Model serving
Hugging Face's production server for high-performance LLM serving.
Together AI
Model serving
A cloud for running and fine-tuning open models with fast, hosted inference.
vLLM
Model serving
A high-throughput inference engine for serving open models efficiently in production.