Skip to content

Local Model Acceleration

Backend.AI GO achieves high performance by utilizing several specialized inference engines. This section explains how these engines work and which one is best for your hardware.

The Inference Stack

When you load a model, Backend.AI GO starts a "sidecar" process—a background server dedicated to running that specific model. This separation ensures stability and allows us to use different technologies for different hardware.

Supported Engines

llama.cpp (Cross-Platform)

The core of our cross-platform support. It's highly optimized for CPU inference and supports GPU acceleration on NVIDIA (CUDA), AMD (ROCm), and Intel.

  • Format: .gguf
  • Best for: Windows, Linux, and Intel-based Macs.

MLX (macOS Native)

An array framework for machine learning on Apple Silicon, developed by Apple research. It provides the best possible performance and memory efficiency for M1/M2/M3/M4 chips.

  • Format: MLX-compatible folders (usually downloaded from Hugging Face).
  • Best for: Apple Silicon Macs.

vLLM (Advanced / Planned)

A high-throughput serving engine optimized for enterprise-grade GPUs. It uses PagedAttention to handle many requests simultaneously.


Auto-Detection

Backend.AI GO automatically detects your hardware (CPU, GPU, RAM) and selects the most appropriate settings. You can monitor this in real-time using the System Metrics dashboard at the bottom of the app.