Skip to main content

Serverless LLM

ServerlessLLM

ServerlessLLM is a fast and easy-to-use serving system designed for affordable multi-LLM serving, also known as LLM-as-a-Service. ServerlessLLM is ideal for environments with multiple LLMs that need to be served on limited GPU resources, as it enables efficient dynamic loading of LLMs onto GPUs. By elastically scaling model instances and multiplexing GPUs, ServerlessLLM can significantly reduce costs compared to traditional GPU-dedicated serving systems while still providing low-latency (Time-to-First-Token, TTFT) LLM completions.

Documentation

Getting Started

ServerlessLLM Serve

ServerlessLLM Store

ServerlessLLM CLI