Skip to main content

cli_api

ServerlessLLM CLI Documentation

Overview

sllm-cli is a command-line interface (CLI) tool designed for managing and interacting with ServerlessLLM models. This document provides an overview of the available commands and their usage.

Getting Started

Before using the sllm-cli commands, you need to start the ServerlessLLM cluster. Follow the guides below to set up your cluster:

After setting up the ServerlessLLM cluster, you can use the commands listed below to manage and interact with your models.

Example Workflow

  1. Deploy a Model

    Deploy a model using the model name, which must be a HuggingFace pretrained model name. i.e. "facebook/opt-1.3b" instead of "opt-1.3b".

    sllm-cli deploy --model facebook/opt-1.3b
  2. Generate Output

    echo '{
    "model": "facebook/opt-1.3b",
    "messages": [
    {
    "role": "user",
    "content": "Please introduce yourself."
    }
    ],
    "temperature": 0.7,
    "max_tokens": 50
    }' > input.json
    sllm-cli generate input.json
  3. Delete a Model

    sllm-cli delete facebook/opt-1.3b

sllm-cli deploy

Deploy a model using a configuration file or model name, with options to overwrite default configurations. The configuration file requires minimal specifications, as sensible defaults are provided for advanced configuration options.

For more details on the advanced configuration options and their default values, please refer to the Example Configuration File section.

Usage
sllm-cli deploy [OPTIONS]
Options
  • --model <model_name>

    • Model name to deploy with default configuration. The model name must be a Hugging Face pretrained model name. You can find the list of available models here.
  • --config <config_path>

    • Path to the JSON configuration file. The configuration file can be incomplete, and missing sections will be filled in by the default configuration.
  • --backend <backend_name>

    • Overwrite the backend in the default configuration.
  • --num_gpus <number>

    • Overwrite the number of GPUs in the default configuration.
  • --target <number>

    • Overwrite the target concurrency in the default configuration.
  • --min_instances <number>

    • Overwrite the minimum instances in the default configuration.
  • --max_instances <number>

    • Overwrite the maximum instances in the default configuration.
Examples

Deploy using a model name with default configuration:

sllm-cli deploy --model facebook/opt-1.3b

Deploy using a configuration file:

sllm-cli deploy --config /path/to/config.json

Deploy using a model name and overwrite the backend:

sllm-cli deploy --model facebook/opt-1.3b --backend transformers

Deploy using a model name and overwrite multiple configurations:

sllm-cli deploy --model facebook/opt-1.3b --num_gpus 2 --target 5 --min_instances 1 --max_instances 5
Example Configuration File (config.json)

This file can be incomplete, and missing sections will be filled in by the default configuration:

{
"model": "facebook/opt-1.3b",
"backend": "transformers",
"num_gpus": 1,
"auto_scaling_config": {
"metric": "concurrency",
"target": 1,
"min_instances": 0,
"max_instances": 10,
"keep_alive": 0
},
"backend_config": {
"pretrained_model_name_or_path": "facebook/opt-1.3b",
"device_map": "auto",
"torch_dtype": "float16",
"hf_model_class": "AutoModelForCausalLM"
}
}

Below is a description of all the fields in config.json.

FieldDescription
modelThis should be a HuggingFace model name, used to identify model instance.
backendInference engine, support transformers and vllm now.
num_gpusNumber of GPUs used to deploy a model instance.
auto_scaling_configConfig about auto scaling.
auto_scaling_config.metricMetric used to decide whether to scale up or down.
auto_scaling_config.targetTarget value of the metric.
auto_scaling_config.min_instancesThe minimum value for model instances.
auto_scaling_config.max_instancesThe maximum value for model instances.
auto_scaling_config.keep_aliveHow long a model instance lasts after inference ends. For example, if keep_alive is set to 30, it will wait 30 seconds after the inference ends to see if there is another request.
backend_configConfig about inference backend.
backend_config.pretrained_model_name_or_pathThe path to load the model, this can be a HuggingFace model name or a local path.
backend_config.device_mapDevice map config used to load the model, auto is suitable for most scenarios.
backend_config.torch_dtypeTorch dtype of the model.
backend_config.hf_model_classHuggingFace model class.

sllm-cli delete

Delete deployed models by name.

Usage
sllm-cli delete [MODELS]
Arguments
  • MODELS
    • Space-separated list of model names to delete.
Example
sllm-cli delete facebook/opt-1.3b facebook/opt-2.7b meta/llama2

sllm-cli generate

Generate outputs using the deployed model.

Usage
sllm-cli generate [OPTIONS] <input_path>
Options
  • -t, --threads <num_threads>
    • Number of parallel generation processes. Default is 1.
Arguments
  • input_path
    • Path to the JSON input file.
Example
sllm-cli generate --threads 4 /path/to/request.json
Example Request File (request.json)
{
"model": "facebook/opt-1.3b",
"messages": [
{
"role": "user",
"content": "Please introduce yourself."
}
],
"temperature": 0.3,
"max_tokens": 50
}

sllm-cli encode (embedding)

Get the embedding using the deployed model.

Usage
sllm-cli encode [OPTIONS] <input_path>
Options
  • -t, --threads <num_threads>
    • Number of parallel encoding processes. Default is 1.
Arguments
  • input_path
    • Path to the JSON input file.
Example
sllm-cli encode --threads 4 /path/to/request.json
Example Request File (request.json)
{
"model": "intfloat/e5-mistral-7b-instruct",
"task_instruct": "Given a question, retrieve passages that answer the question",
"query": [
"Hi, How are you?"
]
}

sllm-cli replay

Replay requests based on workload and dataset.

Usage
sllm-cli replay [OPTIONS]
Options
  • --workload <workload_path>

    • Path to the JSON workload file.
  • --dataset <dataset_path>

    • Path to the JSON dataset file.
  • --output <output_path>

    • Path to the output JSON file for latency results. Default is latency_results.json.
Example
sllm-cli replay --workload /path/to/workload.json --dataset /path/to/dataset.json --output /path/to/output.json

sllm-cli update

Update a deployed model using a configuration file or model name.

Usage
sllm-cli update [OPTIONS]
Options
  • --model <model_name>

    • Model name to update with default configuration.
  • --config <config_path>

    • Path to the JSON configuration file.
Example
sllm-cli update --model facebook/opt-1.3b
sllm-cli update --config /path/to/config.json