Skip to content

Latest commit

 

History

History
 
 

vllm

LeapfrogAI vLLM Backend

A LeapfrogAI API-compatible vllm wrapper for quantized and un-quantized model inferencing across GPU infrastructures.

Usage

Pre-Requisites

See the LeapfrogAI documentation website for system requirements and dependencies.

Dependent Components

Model Selection

The default model that comes with this backend in this repository's officially released images is a 4-bit quantization of the Synthia-7b model.

All of the commands in this sub-section are executed within this packages/vllm sub-directory.

Optionally, you can specify a different model during Zarf creation:

uds zarf package create --confirm --set MODEL_REPO_ID=defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g --set MODEL_REVISION=main

If you decide to use a different model, there will likely be a need to change generation and engine runtime configurations, please see the Zarf Package Config and the values override file for details on what runtime parameters can be modified. These parameters are model-specific, and can be found in the HuggingFace model cards and/or configuration files (e.g., prompt templates).

For example, during Zarf deployment, you can override the Zarf Package Config defaults by doing the following:

uds zarf package deploy zarf-package-vllm-amd64-dev.tar.zst --confirm --set ENFORCE_EAGER=True

Deployment

To build and deploy the vllm backend Zarf package into an existing UDS Kubernetes cluster:

Important

Execute the following commands from the root of the LeapfrogAI repository

pip install 'huggingface_hub[cli,hf_transfer]'  # Used to download the model weights from huggingface
make build-vllm LOCAL_VERSION=dev
uds zarf package deploy packages/vllm/zarf-package-vllm-*-dev.tar.zst --confirm

Local Development

In local development the config.yaml and .env.example must be modified if the model has changed away from the default. The LeapfrogAI SDK picks up the config.yaml automatically, and the .env must be sourced into the Python environment.

Important

Execute the following commands from this sub-directory

Create a .env file based on the .env.example:

cp .env.example .env
source .env

As necessary, modify the existing config.yaml:

vim config.yaml

To run the vllm backend locally:

# Install dev and runtime dependencies
make install

# Clone Model
python src/model_download.py

# Start the model backend
make dev

Local Docker Container

To run the Docker container, use the following Makefile commands. LOCAL_VERSION must be consistent across the two Make commands.

In the root of the LeapfrogAI repository:

LOCAL_VERSION=dev make sdk-wheel

In the root of this vLLM sub-directory:

LOCAL_VERSION=dev make docker