SLURM-based cluster setup guide

This guide will help you get started with running ServerlessLLM on SLURM cluster. It provides two deployment methods, based on sbatch and srun. If you are in development, we recommend using srun, as it is easier to debug than sbatch, and if you are in production mode, sbatch is recommended. Please make sure you have installed the ServerlessLLM following the installation guide on all machines.

Pre-requisites

Before you begin, make sure you have checked the following:

Some Tips about Installation

If 'not enough disk space' is reported when pip install on the login node, you can submit it to a job node for execution

#!/bin/bash
#SBATCH --partition=Teach-Standard
#SBATCH --job-name=ray-head
#SBATCH --output=sllm_pip.out
#SBATCH --error=sllm_pip.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=0

# Identify which conda you are using, here is an example that conda is in /opt/conda
source /opt/conda/bin/activate

conda create -n sllm python=3.10 -y
conda activate sllm
pip install serverless-llm
pip install serverless-llm-store

conda deactivate sllm

conda create -n sllm-worker python=3.10 -y
conda activate sllm-worker
pip install serverless-llm[worker]
pip install serverless-llm-store

Command for Querying GPU Resource Information

Run the following commands in the cluster to check GPU resource information.

sinfo -O partition,nodelist,gres

Expected Output

PARTITION           NODELIST            GRES
Partition1          JobNode[01,03]      gpu:gtx_1060:8
Partition2          JobNode[04-17]      gpu:a6000:2,gpu:gtx_

Identify an idle node

Use sinfo -p <partition> to identify some idle nodes

Expected Output

$ sinfo -p compute
PARTITION AVAIL  NODES  STATE  TIMELIMIT  NODELIST
compute    up       10  idle   infinite   JobNode[01-10]
compute    up        5  alloc  infinite   JobNode[11-15]
compute    up        2  down   infinite   JobNode[16-17]

Job Nodes Setup

srun Node Selection

Only one JobNode is enough.

sbatch Node Selection

Let's start a head on the main job node (JobNode01) and add the worker on other job node (JobNode02). The head and the worker should be on different job nodes to avoid resource contention. The sllm-store should be started on the job node that runs worker (JobNode02), for passing the model weights, and the sllm-serve should be started on the main job node (JobNode01), finally you can use sllm-cli to manage the models on the login node.

Note: JobNode02 requires GPU, but JobNode01 does not.

Head: JobNode01
Worker: JobNode02
sllm-store: JobNode02
sllm-serve: JobNode01
sllm-cli: Login Node

SRUN

If you are in development, we recommend using srun to start ServerlessLLM, as it is easier to debug than sbatch

Step 1: Use `srun` enter the JobNode

To start an interactive session on the specified compute node (JobNode), use:

srun --partition <your-partition> --nodelist <JobNode> --gres <DEVICE>:1 --pty bash

This command requests a session on the specified node and provides an interactive shell. --gres <DEVICE>:1 specifies the GPU device you will use, for example: --gres gpu:gtx_1060:1

Step 2: Install from source

Firstly, please make sure CUDA driver available on the node. Here are some commands to check it.

nvidia-smi

which nvcc

If nvidia-smi has listed GPU information, but which nvcc has no output. Then use the following commands to load nvcc. Here is an example that cuda is located at /opt/cuda-12.2.0

export PATH=/opt/cuda-12.2.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/cuda-12.2.0/lib64:$LD_LIBRARY_PATH

Then, following the installation guide to install from source.

Step 3: Prepare multiple windows with `tmux`

Since srun provides a single interactive shell, you can use tmux to create multiple windows. Start a tmux session:

tmux

This creates a new tmux session

Create multiple windows

Use Ctrl+B → C to start a new window
Repeat the shortcut 4 more times to create a total of 5 windows.

What if Ctrl+B does not work?

If Ctrl + B is unresponsive, reset tmux key bindings:

tmux unbind C-b
tmux set-option -g prefix C-b
tmux bind C-b send-prefix

Command to switch windows

Once multiple windows are created, you can switch between them using:

Ctrl + B → N (Next window) Ctrl + B → P (Previous window) Ctrl + B → W (List all windows and select) Ctrl + B → [Number] (Switch to a specific window, e.g., Ctrl + B → 1)

Step 4: Run ServerlessLLM on the JobNode

First find ports that are already occupied. Then pick your favourite number from the remaining ports to replace the following placeholder <PORT>. For example: 6379

It should also be said that certain slurm system is a bit slow, so please be patient and wait for the system to output.

In the first window, start a local ray cluster with 1 head node and 1 worker node:

source /opt/conda/bin/activate
conda activate sllm
ray start --head --port=<PORT> --num-cpus=4 --num-gpus=0 --resources='{"control_node": 1}' --block

In the second window, start the worker node:

source /opt/conda/bin/activate
conda activate sllm-worker
export CUDA_VISIBLE_DEVICES=0
ray start --address=0.0.0.0:<PORT> --num-cpus=4 --num-gpus=1 --resources='{"worker_node": 1, "worker_id_0": 1}' --block

In the third window, start ServerlessLLM Store server:

source /opt/conda/bin/activate
conda activate sllm-worker
export CUDA_VISIBLE_DEVICES=0
sllm-store start

In the 4th window, start ServerlessLLM Serve:

source /opt/conda/bin/activate
conda activate sllm
sllm-serve start

Everything is set!

In the 5th window, let's deploy a model to the ServerlessLLM server. You can deploy a model by running the following command:

source /opt/conda/bin/activate
conda activate sllm
sllm-cli deploy --model facebook/opt-1.3b --backend transformers

This will download the model from HuggingFace transformers. After deploying, you can query the model by any OpenAI API client. For example, you can use the following Python code to query the model:

curl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "facebook/opt-1.3b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is your name?"}
        ]
    }'

Expected output:

{"id":"chatcmpl-9f812a40-6b96-4ef9-8584-0b8149892cb9","object":"chat.completion","created":1720021153,"model":"facebook/opt-1.3b","choices":[{"index":0,"message":{"role":"assistant","content":"system: You are a helpful assistant.\nuser: What is your name?\nsystem: I am a helpful assistant.\n"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"completion_tokens":26,"total_tokens":42}}

Step 5: Clean up

To delete a deployed model, use the following command:

sllm-cli delete facebook/opt-1.3b

This will remove the specified model from the ServerlessLLM server.

In each window, use Ctrl + c to stop server and exit to exit current tmux session.

SBATCH

Step 1: Start the Head Node

Since the head node does not require a gpu, you can find a low-computing capacity node to deploy the head node.

Activate the sllm environment and start the head node:

Here is the example script, named start_head_node.sh.

#!/bin/bash
#SBATCH --partition=your-partition    # Specify the partition
#SBATCH --nodelist=JobNode01          # Specify an idle node
#SBATCH --job-name=ray-head
#SBATCH --output=sllm_head.out
#SBATCH --error=sllm_head.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --gpus-per-task=0

cd /path/to/ServerlessLLM

source /opt/conda/bin/activate # make sure conda will be loaded correctly
conda activate sllm

ray start --head --port=6379 --num-cpus=12 --num-gpus=0 --resources='{"control_node": 1}' --block

Replace your-partition, JobNode01 and /path/to/ServerlessLLM

Submit the script

Use sbatch start_head_node.sh to submit the script to certain idle node.
Expected output

In sllm_head.out, you will see the following output:
```
Local node IP: <HEAD_NODE_IP>
--------------------
Ray runtime started.
--------------------
```
Remember the IP address, denoted <HEAD_NODE_IP>, you will need it in following steps.
Find an available port for serve

Some HPCs have a firewall that blocks port 8343. You can use nc -zv <HEAD_NODE_IP> 8343 to check if the port is accessible.
If it is not accessible, find an available port and replace available_port in the following script.
Here is an example script, named find_port.sh

#!/bin/bash
#SBATCH --partition=your-partition
#SBATCH --nodelist=JobNode01
#SBATCH --job-name=find_port
#SBATCH --output=find_port.log
#SBATCH --time=00:05:00
#SBATCH --mem=1G

echo "Finding available port on $(hostname)"

python -c "
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.bind(('', 0))
    print(f'Available port: {s.getsockname()[1]}')
"

Use sbatch find_port.sh to submit the script to JobNode01, and in find_port.log, you will see the following output:

Finding available port on JobNode01
Available port: <avail_port>

Remember this <avail_port>, you will use it in Step 4

Step 2: Start the Worker Node & Store

We will start the worker node and store in the same script. Because the server loads the model weights onto the GPU and uses shared GPU memory to pass the pointer to the client. If you submit another script with #SBATCH --gpres=gpu:1, it will be possibly set to use a different GPU, as specified by different CUDA_VISIBLE_DEVICES settings. Thus, they cannot pass the model weights.

Activate the sllm-worker environment and start the worker node.

Here is the example script, namedstart_worker_node.sh.

#!/bin/sh
#SBATCH --partition=your_partition
#SBATCH --nodelist=JobNode02
#SBATCH --gres=gpu:a6000:1             # Specify device on JobNode02
#SBATCH --job-name=sllm-worker-store
#SBATCH --output=sllm_worker.out
#SBATCH --error=sllm_worker.err
#SBATCH --gres=gpu:1                   # Request 1 GPU
#SBATCH --cpus-per-task=4              # Request 4 CPU cores
#SBATCH --mem=16G                      # Request 16GB of RAM

cd /path/to/ServerlessLLM

conda activate sllm-worker

HEAD_NODE_IP=<HEAD_NODE_IP>

export CUDA_HOME=/opt/cuda-12.5.0 # replace with your CUDA path
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

ray start --address=$HEAD_NODE_IP:6379 --num-cpus=4 --num-gpus=1 \
--resources='{"worker_node": 1, "worker_id_0": 1}' --block &

sllm-store start &

wait

Read the HPC's documentation to find out which partition you can use. Replace your_partition in the script with that partition name.
Replace /path/to/ServerlessLLM with the path to the ServerlessLLM installation directory.
Replace <HEAD_NODE_IP> with the IP address of the head node.
Replace /opt/cuda-12.5.0 with the path to your CUDA path.

Find the CUDA path
- Some slurm-based HPCs have a module system, you can use module avail cuda to find the CUDA module.
- If it does not work, read the HPC's documentation carefully to find the CUDA path. For example, my doc said CUDA is in \opt. Then you can use srun command to start an interactive session on the node, such as srun --pty -t 00:30:00 -p your_partition --gres=gpu:1 /bin/bash. A pseudo-terminal will be started for you to find the path.
- Find it and replace /opt/cuda-12.5.0 with the path to your CUDA path.
Submit the script on the other node

Use sbatch start_worker_node.sh to submit the script to certain idle node (here we assume it is JobNode02). In addition, We recommend that you place the head and worker on different nodes so that the Serve can start smoothly later, rather than queuing up for resource allocation.

Expected output

In sllm_worker.out, you will see the following output:

The worker node expected output:

 Local node IP: xxx.xxx.xx.xx
 --------------------
 Ray runtime started.
 --------------------

The store expected output:

I20241030 11:52:54.719007 1321560 checkpoint_store.cpp:41] Number of GPUs: 1
I20241030 11:52:54.773468 1321560 checkpoint_store.cpp:43] I/O threads: 4, chunk size: 32MB
I20241030 11:52:54.773548 1321560 checkpoint_store.cpp:45] Storage path: "./models/"
I20241030 11:52:55.060559 1321560 checkpoint_store.cpp:71] GPU 0 UUID: 52b01995-4fa9-c8c3-a2f2-a1fda7e46cb2
I20241030 11:52:55.060798 1321560 pinned_memory_pool.cpp:29] Creating PinnedMemoryPool with 128 buffers of 33554432 bytes
I20241030 11:52:57.258795 1321560 checkpoint_store.cpp:83] Memory pool created with 4GB
I20241030 11:52:57.262835 1321560 server.cpp:306] Server listening on 0.0.0.0:8073

Step 3: Start the Serve on the Head Node

Activate the sllm environment and start the serve.

Here is the example script, namedstart_serve.sh.

#!/bin/sh
#SBATCH --partition=your_partition
#SBATCH --nodelist=JobNode01           # This node should be the same as head
#SBATCH --output=serve.log

cd /path/to/ServerlessLLM

conda activate sllm

sllm-serve start --host <HEAD_NODE_IP>
# sllm-serve start --host <HEAD_NODE_IP> --port <avail_port> # if you have changed the port

Replace your_partition in the script as before.
Replace /path/to/ServerlessLLM as before.
Replace <avail_port> you have found in Step 1 (if port 8343 is not available).

Submit the script on the head node

Use sbatch start_serve.sh to submit the script to the head node (JobNode01).

Expected output

-- Connecting to existing Ray cluster at address: xxx.xxx.xx.xx:6379...
-- Connected to Ray cluster.
INFO:     Started server process [1339357]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://xxx.xxx.xx.xx:8343 (Press CTRL+C to quit)

Step 4: Use sllm-cli to manage models

You can do this step on login node, and set the LLM_SERVER_URL environment variable:
```
$ conda activate sllm
(sllm)$ export LLM_SERVER_URL=http://<HEAD_NODE_IP>:8343/
```
- Replace <HEAD_NODE_IP> with the actual IP address of the head node.
- Replace 8343 with the actual port number (<avail_port> in Step1) if you have changed it.

Deploy a Model Using sllm-cli

(sllm)$ sllm-cli deploy --model facebook/opt-1.3b

Step 5: Query the Model Using OpenAI API Client

You can use the following command to query the model:

curl http://<HEAD_NODE_IP>:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
      "model": "facebook/opt-1.3b",
      "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is your name?"}
      ]
   }'

Replace <HEAD_NODE_IP> with the actual IP address of the head node.
Replace 8343 with the actual port number (<avail_port> in Step 1) if you have changed it.

Step 6: Stop Jobs

On the SLURM cluster, we usually use the scancel command to stop the job. Firstly, list all jobs you have submitted (replace your_username with your username):

$ squeue -u your_username
JOBID    PARTITION     NAME                USER       ST  TIME  NODES NODELIST(REASON)
  1234    compute   sllm-head         your_username  R   0:01      1    JobNode01
  1235    compute   sllm-worker-store your_username  R   0:01      1    JobNode02
  1236    compute   sllm-serve        your_username  R   0:01      1    JobNode01

Then, use scancel to stop the job (1234, 1235 and 1236 are JOBIDs):

$ scancel 1234 1235 1236

SLURM-based cluster setup guide

Pre-requisites​

Some Tips about Installation​

Command for Querying GPU Resource Information​

Identify an idle node​

Job Nodes Setup​

SRUN​

Step 1: Use srun enter the JobNode​

Step 2: Install from source​

Step 3: Prepare multiple windows with tmux​

Step 4: Run ServerlessLLM on the JobNode​

Step 5: Clean up​

SBATCH​

Step 1: Start the Head Node​

Step 2: Start the Worker Node & Store​

Step 3: Start the Serve on the Head Node​

Step 4: Use sllm-cli to manage models​

Step 5: Query the Model Using OpenAI API Client​

Step 6: Stop Jobs​