4.1. Local Model Acceleration¶

Backend.AI GO achieves high performance by utilizing several specialized inference engines. This section explains how these engines work and which one is best for your hardware.

The Inference Stack¶

When you load a model, Backend.AI GO starts a "sidecar" process—a background server dedicated to running that specific model. This separation ensures stability and allows us to use different technologies for different hardware.

Supported Engines¶

llama.cpp (Cross-Platform)¶

The core of our cross-platform support. It's highly optimized for CPU inference and supports GPU acceleration on NVIDIA (CUDA), AMD (ROCm), and Intel.

Format: .gguf
Best for: Windows, Linux, and Intel-based Macs.

MLX (macOS Native)¶

An array framework for machine learning on Apple Silicon, developed by Apple research. It provides the best possible performance and memory efficiency for M1/M2/M3/M4 chips.

Format: MLX-compatible folders (usually downloaded from Hugging Face).
Best for: Apple Silicon Macs.

vLLM (Advanced / Planned)¶

A high-throughput serving engine optimized for enterprise-grade GPUs. It uses PagedAttention to handle many requests simultaneously.

stable-diffusion.cpp (Image Generation)¶

A pure C/C++ implementation for text-to-image generation using Stable Diffusion models. Supports SD 1.x, SD 2.x, SDXL, Flux, and more.

Format: .safetensors, .ckpt, .gguf
Best for: Local image generation on any platform.

Auto-Detection¶

Backend.AI GO automatically detects your hardware (CPU, GPU, RAM) and selects the most appropriate settings. You can monitor this in real-time using the System Metrics dashboard at the bottom of the app.