Single machine (from scratch)
This guide provides instructions for setting up ServerlessLLM from scratch on a single machine. This 'from scratch' approach means you will manually initialize and manage the Ray cluster components. It involves using multiple terminal sessions, each configured with a distinct Conda environment, to run the head and worker processes on the same physical machine, effectively simulating a multi-node deployment locally.
We strongly recommend using Docker (Compose) as detailed in the Docker Compose guide. Docker provides a smoother and generally easier setup process. Follow this guide only if Docker is not a suitable option for your environment.
Installation
Requirements
Ensure your system meets the following prerequisites:
- OS: Ubuntu 20.04
- Python: 3.10
- GPU: NVIDIA GPU with compute capability 7.0 or higher
Installing with pip
Follow these steps to install ServerlessLLM using pip:
Create the head environment:
# Create and activate a conda environment
conda create -n sllm python=3.10 -y
conda activate sllm
# Install ServerlessLLM and its store component
pip install serverless-llm serverless-llm-store
Create the worker environment:
# Create and activate a conda environment
conda create -n sllm-worker python=3.10 -y
conda activate sllm-worker
# Install ServerlessLLM (worker version) and its store component
pip install "serverless-llm[worker]" serverless-llm-store
If you plan to integrate vLLM with ServerlessLLM, a patch needs to be applied to the vLLM repository. For detailed instructions, please refer to the vLLM Patch section.
Installing from Source
To install ServerlessLLM from source, follow these steps:
-
Clone the repository:
git clone https://github.com/ServerlessLLM/ServerlessLLM.git
cd ServerlessLLM -
Create the head environment:
# Create and activate a conda environment
conda create -n sllm python=3.10 -y
conda activate sllm
# Install sllm_store (pip install is recommended for speed)
cd sllm_store && rm -rf build
pip install .
cd ..
# Install ServerlessLLM
pip install . -
Create the worker environment:
# Create and activate a conda environment
conda create -n sllm-worker python=3.10 -y
conda activate sllm-worker
# Install sllm_store (pip install is recommended for speed)
cd sllm_store && rm -rf build
pip install .
cd ..
# Install ServerlessLLM (worker version)
pip install ".[worker]"
vLLM Patch
To use vLLM with ServerlessLLM, you must apply a patch. The patch file is located at sllm_store/vllm_patch/sllm_load.patch
within the ServerlessLLM repository. This patch has been tested with vLLM version 0.6.6
.
Apply the patch using the following script:
conda activate sllm-worker
./sllm_store/vllm_patch/patch.sh
Running ServerlessLLM Locally
These steps describe how to run ServerlessLLM on your local machine.
1. Start a Local Ray Cluster
First, initiate a local Ray cluster. This cluster will consist of one head node and one worker node (on the same machine).
Start the head node:
Open a new terminal and run:
conda activate sllm
ray start --head --port=6379 --num-cpus=4 --num-gpus=0 \
--resources='{"control_node": 1}' --block
Start the worker node:
Open another new terminal and run:
conda activate sllm-worker
export CUDA_VISIBLE_DEVICES=0 # Or your desired GPU ID
ray start --address=0.0.0.0:6379 --num-cpus=4 --num-gpus=1 \
--resources='{"worker_node": 1, "worker_id_0": 1}' --block
2. Start the ServerlessLLM Store Server
Next, start the ServerlessLLM Store server. By default, it uses ./models
as the storage path.
Open a new terminal and run:
conda activate sllm-worker
export CUDA_VISIBLE_DEVICES=0 # Or your desired GPU ID
sllm-store start
Expected output:
$ sllm-store start
INFO 12-31 17:13:23 cli.py:58] Starting gRPC server
INFO 12-31 17:13:23 server.py:34] StorageServicer: storage_path=./models, mem_pool_size=4294967296, num_thread=4, chunk_size=33554432, registration_required=False
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20241231 17:13:23.947276 2165054 checkpoint_store.cpp:41] Number of GPUs: 1
I20241231 17:13:23.947299 2165054 checkpoint_store.cpp:43] I/O threads: 4, chunk size: 32MB
I20241231 17:13:23.947309 2165054 checkpoint_store.cpp:45] Storage path: "./models"
I20241231 17:13:24.038651 2165054 checkpoint_store.cpp:71] GPU 0 UUID: c9938b31-33b0-e02f-24c5-88bd6fbe19ad
I20241231 17:13:24.038700 2165054 pinned_memory_pool.cpp:29] Creating PinnedMemoryPool with 128 buffers of 33554432 bytes
I20241231 17:13:25.557906 2165054 checkpoint_store.cpp:83] Memory pool created with 4GB
INFO 12-31 17:13:25 server.py:243] Starting gRPC server on 0.0.0.0:8073
3. Start ServerlessLLM Serve
Now, start the ServerlessLLM Serve process ( sllm-serve
).
Open a new terminal and run:
conda activate sllm
sllm-serve start
At this point, you should have four terminals open: one for the Ray head node, one for the Ray worker node, one for the ServerlessLLM Store server, and one for ServerlessLLM Serve.
4. Deploy a Model
With all services running, you can deploy a model.
Open a new terminal and run:
conda activate sllm
sllm-cli deploy --model facebook/opt-1.3b
This command downloads the specified model from Hugging Face Hub. To load a model from a local path, you can use a config.json
file. Refer to the CLI API documentation for details.
5. Query the Model
Once the model is deployed, you can query it using any OpenAI API-compatible client. For example, use the following curl
command:
curl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-1.3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is your name?"}
]
}'
Expected output:
{"id":"chatcmpl-9f812a40-6b96-4ef9-8584-0b8149892cb9","object":"chat.completion","created":1720021153,"model":"facebook/opt-1.3b","choices":[{"index":0,"message":{"role":"assistant","content":"system: You are a helpful assistant.\nuser: What is your name?\nsystem: I am a helpful assistant.\n"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"completion_tokens":26,"total_tokens":42}}
Clean Up
To delete a deployed model, use the following command:
sllm-cli delete facebook/opt-1.3b
This command removes the specified model from the ServerlessLLM server.