4.6. vLLM Acceleration¶

Under Development

This feature is currently under active development. It may not be included in the stable release version or may have limited functionality.

vLLM is a high-throughput and memory-efficient LLM serving engine. While llama.cpp focuses on broad compatibility and low-resource environments, vLLM is designed for production-grade performance on high-end GPUs.

What is vLLM?¶

Repository: github.com/vllm-project/vllm
Core Innovation: PagedAttention.

Traditional attention algorithms require contiguous memory blocks, which leads to significant fragmentation and waste (up to 60-80% of memory). vLLM introduces PagedAttention, an algorithm inspired by virtual memory paging in operating systems. It allows KV cache blocks to be stored in non-contiguous memory spaces.

Key Benefits¶

Higher Throughput: Can handle many more concurrent requests than standard Hugging Face Transformers.
Efficient Memory: Near-zero waste of GPU VRAM, allowing for larger batch sizes or longer context windows.
State-of-the-Art: Often the first to support new cutting-edge features like continuous batching and quantization (AWQ, GPTQ, SqueezeLLM).

Role in Backend.AI GO¶

In Backend.AI GO, vLLM serves as the "Pro" engine, primarily for users with NVIDIA GPUs (on Linux or Windows WSL).

Note: vLLM support in Backend.AI GO is currently in Beta.

When to use vLLM?¶

You should consider switching to the vLLM backend if:

You have a powerful NVIDIA GPU: vLLM shines on cards like the RTX 5090 (32GB), professional RTX Pro 6000 (96GB), or DGX Spark (GB10, 128GB) personal AI workstations.
Concurrency Matters: You are using Backend.AI GO as a server for multiple users or agents simultaneously.
Throughput over Latency: You need to process massive amounts of text (e.g., summarizing hundreds of documents) where total completion time is more important than the "time to first token".

Configuration¶

To use vLLM in Backend.AI GO:

Go to Settings > Inference.
Change the default backend for CUDA from llama.cpp to vLLM.
Ensure you have the appropriate NVIDIA drivers and CUDA toolkit installed (usually version 11.8 or higher).

Supported Model Formats¶

Unlike llama.cpp which uses .gguf, vLLM works directly with:

Hugging Face format: Standard .safetensors or .bin weights.
AWQ / GPTQ: Pre-quantized models for faster inference.

vLLM brings data-center class inference performance directly to your high-end workstation.