Installation with ROCm (Experimental)
Latest Tested Version
- v0.5.1
Tested Hardware
- OS: Ubuntu 22.04
- ROCm: 6.2
- PyTorch: 2.3.0
- GPU: MI100s (gfx908), MI200s (gfx90a)
Build the wheel from source and install
ServerlessLLM Store (sllm-store
) currently provides experimental support for ROCm platform. Due to an internal bug in ROCm, serverless-llm-store may face a GPU memory leak in ROCm before version 6.2.0, as noted in issue.
Currently, pip install .
does not work with ROCm. We suggest you build sllm-store
wheel and manually install it in your environment.
To build sllm-store
from source, we suggest you using the docker and build it in ROCm container.
- Clone the repository and enter the
store
directory:
git clone https://github.com/ServerlessLLM/ServerlessLLM.git
cd ServerlessLLM/sllm_store
- Build the Docker image from
Dockerfile.rocm
. TheDockerfile.rocm
is build on top ofrocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
image.
docker build -t sllm_store_rocm -f Dockerfile.rocm .
- Build the package inside the ROCm docker container
docker run -it --rm -v $(pwd)/dist:/app/dist sllm_store_rocm /bin/bash
rm -rf /app/dist/* # remove the existing built files
python setup.py sdist bdist_wheel
- Install pytorch and package in local environment
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
pip install dist/*.whl
Verify the Installation
End to end tests
Transformer model Loading and Inference
- Save the
facebook/opt-1.3b
model in./models
directory
python3 examples/sllm_store/save_transformers_model.py --model_name facebook/opt-1.3b --storage_path ./models
- Start the
sllm-store
server
sllm-store-server
- Load the model and run the inference in another terminal
python3 examples/sllm_store/load_transformers_model.py --model_name facebook/opt-1.3b --storage_path ./models
Expected Output:
DEBUG 10-31 10:43:14 transformers.py:178] load_dict_non_blocking takes 0.008747100830078125 seconds
DEBUG 10-31 10:43:14 transformers.py:189] load config takes 0.0016036033630371094 seconds
DEBUG 10-31 10:43:14 torch.py:137] allocate_cuda_memory takes 0.0041697025299072266 seconds
DEBUG 10-31 10:43:14 client.py:72] load_into_gpu: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
INFO 10-31 10:43:14 client.py:113] Model loaded: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
INFO 10-31 10:43:14 torch.py:160] restore state_dict takes 0.0017423629760742188 seconds
DEBUG 10-31 10:43:14 transformers.py:199] load model takes 0.17534756660461426 seconds
INFO 10-31 10:43:14 client.py:117] confirm_model_loaded: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
INFO 10-31 10:43:14 client.py:125] Model loaded
Model loading time: 0.20s
~/miniconda3/envs/sllm/lib/python3.10/site-packages/transformers/generation/utils.py:1249: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
Hello, my dog is cute and I want to give him a good home. I have a
vLLM model Loading and Inference
Directly installing vLLM v0.5.0.post1 may not work with ROCm 6.2.0. This issue is due to the ambiguity of a function call in ROCm 6.2.0. You may change the vLLM's source code as in this commit.
Similar as in CUDA, you need to apply our patch sllm_store/vllm_patch/sllm_load.patch
to the installed vLLM library.
./sllm_store/vllm_patch/patch.sh
- Save the
facebook/opt-1.3b
model in./models
directory
python3 examples/sllm_store/save_vllm_model.py --model_name facebook/opt-1.3b --storage_path ./models
- Start the
sllm-store
server
sllm-store-server
- Load the model and run the inference in another terminal
python3 examples/sllm_store/load_vllm_model.py --model_name facebook/opt-1.3b --storage_path ./models
Expected Output:
INFO 10-31 11:05:16 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='./models/facebook/opt-1.3b', speculative_config=None, tokenizer='./models/facebook/opt-1.3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.SERVERLESS_LLM, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./models/facebook/opt-1.3b)
INFO 10-31 11:05:17 selector.py:56] Using ROCmFlashAttention backend.
INFO 10-31 11:05:17 selector.py:56] Using ROCmFlashAttention backend.
DEBUG 10-31 11:05:17 torch.py:137] allocate_cuda_memory takes 0.0005428791046142578 seconds
DEBUG 10-31 11:05:17 client.py:72] load_into_gpu: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
INFO 10-31 11:05:17 client.py:113] Model loaded: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
INFO 10-31 11:05:17 torch.py:160] restore state_dict takes 0.0013034343719482422 seconds
INFO 10-31 11:05:17 client.py:117] confirm_model_loaded: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
INFO 10-31 11:05:17 client.py:125] Model loaded
INFO 10-31 11:05:17 model_runner.py:160] Loading model weights took 0.0000 GB
INFO 10-31 11:05:25 gpu_executor.py:83] # GPU blocks: 18509, # CPU blocks: 1365
INFO 10-31 11:05:26 model_runner.py:903] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-31 11:05:26 model_runner.py:907] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-31 11:05:31 model_runner.py:979] Graph capturing finished in 6 secs.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 12.13it/s, est. speed input: 78.83 toks/s, output: 194.04 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Joel, and I have been working as a web designer/developer for the'
Prompt: 'The president of the United States is', Generated text: " speaking in an increasingly important national security forum and he's not using the right words"
Prompt: 'The capital of France is', Generated text: " Paris.\nYeah but you couldn't get it through a French newspaper!"
Prompt: 'The future of AI is', Generated text: ' literally in your hands\nDespite all the hype, AI isn’t here'
Python tests
- Install the test dependencies
cd ServerlessLLM
pip install -r requirements-test.txt
- Run the tests
cd ServerlessLLM/sllm_store/tests/python
pytest
C++ tests
- Build the C++ tests
cd ServerlessLLM/sllm_store
bash build.sh
- Run the tests
cd ServerlessLLM/sllm_store/build
ctest --output-on-failure
Known issues
- GPU memory leak in ROCm before version 6.2.0.
This issue is due to an internal bug in ROCm. After the inference instance is completed, the GPU memory is still occupied and not released. For more information, please refer to issue.
- vLLM v0.5.0.post1 can not be built in ROCm 6.2.0
This issue is due to the ambiguity of a function call in ROCm 6.2.0. You may change the vLLM's source code as in this commit.