Skip to main content

PEFT LoRA Serving

This example illustrates the process of deploying and serving a base large language model enhanced with LoRA (Low-Rank Adaptation) adapters in a ServerlessLLM cluster. It demonstrates how to start the cluster, deploy a base model with multiple LoRA adapters, perform inference using different adapters, and update or remove the adapters dynamically.

Pre-requisites

To run this example, we will use Docker Compose to set up a ServerlessLLM cluster. Before proceeding, please ensure you have read the Quickstart Guide.

We will use the following example base model & LoRA adapters

  • Base model: facebook/opt-125m
  • LoRA adapters:
    • peft-internal-testing/opt-125m-dummy-lora
    • monsterapi/opt125M_alpaca
    • edbeeching/opt-125m-lora
    • Hagatiana/opt-125m-lora

Usage

Start a local Docker-based ray cluster using Docker Compose.

Step 1: Download the Docker Compose File

Download the docker-compose.yml file from the ServerlessLLM repository:

# Create a directory for the ServerlessLLM Docker setup
mkdir serverless-llm-docker && cd serverless-llm-docker

# Download the docker-compose.yml file
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml

# Alternatively, you can use wget:
# wget https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml

Step 2: Configuration

Set the Model Directory. Create a directory on your host machine where models will be stored and set the MODEL_FOLDER environment variable to point to this directory:

export MODEL_FOLDER=/path/to/your/models

Replace /path/to/your/models with the actual path where you want to store the models.

Step 3: Start the Services

Start the ServerlessLLM services using Docker Compose:

docker compose up -d

This command will start the Ray head node and two worker nodes defined in the docker-compose.yml file.

tip

Use the following command to monitor the logs of the head node:

docker logs -f sllm_head

Step 4: Deploy Models with LoRA Adapters

  1. Configuration
conda activate sllm
export LLM_SERVER_URL=http://127.0.0.1:8343
  1. Deploy models with specified lora adapters.
sllm-cli deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters demo_lora1=peft-internal-testing/opt-125m-dummy-lora demo_lora2=monsterapi/opt125M_alpaca
  1. Verify the deployment.
curl $LLM_SERVER_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is your name?"}
],
"lora_adapter_name": "demo_lora1"
}'

If no lora adapters specified, the system will use the base model to do inference

curl $LLM_SERVER_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is your name?"}
]
}'

Step 5: Update LoRA Adapters

If you wish to switch to a different set of LoRA adapters, you can still use sllm-cli deploy command with updated adapter configurations. ServerlessLLM will automatically reload the new adapters without restarting the backend.

sllm-cli deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters demo-lora1=edbeeching/opt-125m-lora demo-lora2=Hagatiana/opt-125m-lora

Step 6: Clean Up

Delete the lora adapters by running the following command (this command will only delete lora adapters, the base model won't be deleted):

sllm-cli delete facebook/opt-125m --lora-adapters demo-lora1 demo-lora2

If you need to stop and remove the containers, you can use the following commands:

docker compose down