Getting Started with GeeeekExplorer/nano-vllm: Installation, Configuration, and Running Nano-VLLM
Getting Started Fast: Prerequisites, Repository Setup, and Quick Install
To begin quickly, ensure you have the following prerequisites:
- Python: Version 3.9+ (64-bit)
- Git: Installed and accessible.
- Operating System: Linux or macOS are preferred. Windows users should utilize WSL for optimal compatibility.
- GPU/CPU: NVIDIA drivers and CUDA toolkit are required for GPU acceleration. For CPU-only inference, ensure CPU PyTorch is installed.
Repository Setup and Virtual Environment
Setting up the repository and a dedicated virtual environment is straightforward. Here’s the minimal setup to start working with nano-vllm:
| Step | Command / Description |
|---|---|
| Clone the repository | git clone https://github.com/GeeeekExplorer/nano-vllm.git; cd nano-vllmThis command fetches the official repository and navigates you into the project directory. |
| Set up the virtual environment | python -m venv venvsource venv/bin/activate (Linux/macOS)venv\Scripts\activate (Windows)This creates an isolated Python environment and activates it for immediate use. |
install Dependencies and PyTorch
Install the project’s dependencies and PyTorch:
- Upgrade pip:
python -m pip install --upgrade pip - Install requirements:
pip install -r requirements.txt - Install PyTorch:
- CPU-only:
pip install torch torchvision torchaudio - CUDA (e.g., cu118):
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
- CPU-only:
Weights Download and Verification
Fetching model weights should be a fast and trustworthy process. Use the included script to download and then confirm integrity.
| Step | Command | Notes |
|---|---|---|
| Download 7B weights | bash scripts/download_weights.sh 7B |
Downloads weights.bin to the current directory. |
| Verify checksum | sha256sum weights.bin |
Compare the output to the known value provided in release notes. A match confirms file integrity. |
Once the checksum verifies successfully, you can proceed to load weights.bin into your environment.
Minimal Configuration
Configuration is kept simple with a single YAML file that defines server and model parameters. This makes it easy to manage and version alongside your code.
Save the following as config.yaml:
server:
host: 0.0.0.0
port: 8000
model:
dir: ./models/7B
quantization: 4bit
device: auto
Field guide-to-npm-essential-commands-best-practices-and-troubleshooting-for-node-js-developers/”>guide
| Section | Field | Description | Example |
|---|---|---|---|
| server | host | Address to bind the HTTP server to. | 0.0.0.0 |
| server | port | Port the server listens on. | 8000 |
| model | dir | Filesystem path to the model weights. | ./models/7B |
| model | quantization | Quantization scheme to load the model with. | 4bit |
| model | device | Compute device hint (e.g., auto, cpu, cuda). | auto |
Starting the Nano-VLLM Server
Launch the server quickly and monitor its startup to ensure it’s ready to handle requests.
Launch the server:
python -m nano_vllm.serve --config config.yaml
Monitor startup logs: Look for readiness indicators and the endpoint URL. A typical endpoint to test is http://0.0.0.0:8000/v1/generate.
| What to look for | What it means |
|---|---|
| Readiness line | Server is up and ready to handle requests. |
| Endpoint URL | Base URL for generation requests; use /v1/generate. |
Running and Benchmarking Nano-VLLM: Demos, API, and Nightly Benchmarks
Demo Run and API Usage
See the API in action by sending an HTTP request to generate text from the model.
Query via HTTP API (local run):
curl -X POST http://localhost:8000/v1/generate -H 'Content-Type: application/json' -d '{"prompt": "Hello world", "max_tokens": 64}'
Ensure your local server is running at the specified address before executing this command.
What to tune and why
Adjusting core parameters allows you to observe how the model’s behavior and latency change. Here’s a guide to the main parameters:
| Parameter | What it controls | Impact on output | Latency impact | Example values |
|---|---|---|---|---|
| max_tokens | Length of generated text (in tokens). | Longer outputs are more likely to be informative or verbose. | Increases roughly with the number of tokens generated. | 64, 128, 256 |
| temperature | Creativity/randomness of sampling. | Lower values produce more deterministic text; higher values add variety. | Typically small but can vary with token choices. | 0.2, 0.7, 1.0 |
| top_p | Nucleus sampling threshold. Controls how much of the probability mass is considered. | Lower means more focused outputs. | Generally minor, but can vary with output length and token choices. | 0.8, 0.95, 1.0 |
Simple experiments you can run:
- Start baseline:
max_tokens = 64,temperature = 0.7,top_p = 0.95 - Increase length: Use
max_tokens = 128and observe longer responses. - Shift creativity: Set
temperature = 0.2for more deterministic output. - Narrow focus: Use
top_p = 0.8to see more concentrated results.
Example variations (same prompt, different payloads):
curl -X POST http://localhost:8000/v1/generate -H 'Content-Type: application/json' -d '{"prompt": "Hello world", "max_tokens": 128, "temperature": 0.2, "top_p": 0.95}'
curl -X POST http://localhost:8000/v1/generate -H 'Content-Type: application/json' -d '{"prompt": "Hello world", "max_tokens": 32, "temperature": 0.9, "top_p": 0.8}'
Tip: Increasing max_tokens often increases generation time. When benchmarking, establish a baseline and compare relative changes when tweaking temperature and top_p to observe the trade-offs between output length, style, and latency.
Benchmarking and Nightly Results
Nightly benchmarks provide a real-time performance check for vLLM. Each major update includes fresh measurements, allowing you to track performance shifts over time.
See the vLLM performance dashboard for the latest results. Nightly benchmarks compare vLLM’s performance against alternatives like TGI, TRT-LLM, and LMDeploy during major updates.
Common metrics in nightly results:
| Metric | What it indicates |
|---|---|
| Latency (ms/token) | Average speed at which the model generates each token. |
| Throughput (tokens/sec) | Number of tokens that can be produced per second under load. |
| Memory usage | Peak memory footprint during inference. |
Open-Source vLLM Serving Landscape: A Straightforward Comparison
Here’s a comparison of Nano-VLLM with other popular open-source vLLM serving solutions:
| Item | Core Strengths | Model Support | Performance Characteristics | Setup & Dependencies | Ideal Use Case |
|---|---|---|---|---|---|
| Nano-VLLM (GeeeekExplorer) | Lightweight; minimal dependencies; quick start | Supports 7B models with 4-bit quantization | Low footprint; rapid deployment for prototyping | Very lightweight; minimal setup and configuration | Rapid prototyping; edge deployments |
| TGI | Broad model support and feature set | Broad model support | Heavier runtime; more setup complexity | Higher setup complexity; heavier runtime environment | Use when broad coverage and features are needed, accepting higher complexity |
| TRT-LLM | TensorRT-accelerated backend | Optimized for NVIDIA GPUs | Best latency on NVIDIA GPUs | Higher setup investment; hardware-specific (NVIDIA GPUs) | Low-latency inference on NVIDIA GPUs |
| LMDeploy | Flexible, multi-backend serving framework | Multi-backend support | Balanced setup complexity and deployment versatility | Moderate setup complexity; versatile deployment options | Deployment versatility across backends |
Pros and Cons of Getting Started with Nano-VLLM
Pros
- Very fast to get a local demo up and running.
- Low memory footprint with 4-bit quantization.
- Minimal dependencies.
- A straightforward CLI.
Cons
- Might lack some advanced features found in heavier stacks.
- Model availability can depend on legally obtainable weights.
- Tooling and community examples are still maturing.

Leave a Reply