vLLM
GraduatedHigh-throughput LLM inference server with PagedAttention
The gold standard for self-hosted model serving. PagedAttention makes KV cache management memory-efficient. Supports continuous batching and OpenAI-compatible API. Run this on GPU.
