4.2. Engine & Runtime Management¶

The Engines menu is your central hub for managing inference engines and their runtime dependencies. Here you can download, update, and configure different versions of llama.cpp, MLX, and other inference backends optimized for your specific hardware.

Understanding the Engines Page¶

The Engines page is divided into three main sections:

Installed Engines - Engines currently available on your system
Installed Runtimes - Runtime libraries (CUDA, ROCm, etc.) that engines depend on
Available Engines - New engines you can download from the official registry

Engine management

Why Manage Engines?¶

Hardware-Specific Optimization¶

Different GPUs require different engine builds:

Hardware	Optimal Engine Variant
NVIDIA RTX/GeForce	llama.cpp with CUDA 13
NVIDIA (older)	llama.cpp with CUDA 12
AMD Radeon/Instinct	llama.cpp with ROCm/HIP
Intel Arc	llama.cpp with SYCL or Vulkan
Apple Silicon	llama.cpp with Metal, MLX, or MLXcel
CPU-only	llama.cpp CPU (AVX2/AVX-512)

Supported Inference Engines¶

Backend.AI GO supports multiple inference engines:

Engine	Supported Formats	Platform	Description
llama.cpp	GGUF	All	High-performance GGUF model inference
MLX LM	Safetensors, GGUF	macOS (Apple Silicon)	Apple MLX-based inference
MLXcel	Safetensors	macOS (Apple Silicon)	Lablup's optimized MLX serving engine

Format-Specific Default Engines

When multiple engines support the same format (e.g., MLX LM and MLXcel both support Safetensors), you can configure which engine to use by default in Settings > Inference > Default engines by format.

Version Control¶

Keep multiple engine versions installed side-by-side:

Test new releases before committing
Roll back if a new version has issues
Compare performance between versions

Dependency Management¶

Some engines require runtime libraries (like CUDA or ROCm). The Engines page automatically detects and manages these dependencies, installing them when needed.

Installing an Engine¶

From the Registry¶

Navigate to the Engines page from the sidebar menu.
Scroll to the Available Engines section.
Find the engine you want (e.g., llama.cpp).
Click the Download button.
If multiple variants are available (e.g., CUDA 13, Metal, CPU), a dialog will appear:
- Recommended variants are marked based on your detected hardware
- Each variant shows its download size
- Select the variant that matches your GPU
The download begins immediately. Watch the progress in the floating Download Queue panel.

Download Progress Stages¶

The installation process goes through several stages:

Stage	Description
Downloading	Fetching the engine package from the registry
Extracting	Unpacking the compressed archive
Verifying	Checking file integrity via checksums
Installing	Copying files to the final location

Runtime Dependencies

If an engine requires a runtime library (like CUDA) that isn't installed, it will be downloaded automatically. You'll see a separate progress indicator for runtime downloads.

Offline Installation¶

For air-gapped environments or when you prefer manual downloads:

Download .baiengine package files from Backend.AI GO Releases.

Place them in the incoming directory:

macOS / LinuxWindows

~/.backend-ai-go/engines/incoming/

%APPDATA%\backend.ai-go\engines\incoming\

Open the Engines page. A Pending Packages banner will appear.
Click Import to install the detected packages.

Alternatively, you can drag and drop .baiengine files directly onto the Engines page.

Managing Installed Engines¶

Engine Cards¶

Each installed engine is displayed as a card showing:

Engine name and version (e.g., llama.cpp 1.0.0)
Accelerator badge (Metal, CUDA 13, CPU, etc.)
Status badges:
- 🟢 Active - Currently running a model
- 🟠 Update Available - Newer version in registry
Supported formats (GGUF, MLX, etc.)
Installation size

Actions¶

Action	Description
Star icon	Mark as default engine for this engine type
Refresh icon	Update to the latest version (when available)
Trash icon	Uninstall the engine
Card click	Open the details drawer

Setting a Default Engine¶

When you have multiple variants of the same engine (e.g., both CUDA 13 and CPU versions of llama.cpp), you can set one as the default:

Click the star icon on the engine card.
The star becomes filled, indicating this is now the default.
When loading models, Backend.AI GO will prefer this engine variant.

Multiple Defaults

You can have one default per base engine. For example, you might have a default llama.cpp (CUDA 13) and a default MLX engine simultaneously.

Engine Details Drawer¶

Click on any engine card to open the details drawer with three tabs:

Overview Tab¶

Basic Information: ID, version, accelerator, installation date
Manifest Details: Format version, upstream version, platform compatibility
Accelerator Information: Specific backend details and optimizations

Files Tab¶

Browse the installed files:

Directory structure
File sizes
Useful for troubleshooting or verifying installation

Dependencies Tab¶

View runtime dependencies:

Required vs. optional dependencies
Installation status of each dependency
Version requirements

Runtime Libraries¶

What are Runtimes?¶

Runtime libraries are shared dependencies that engines need to function. The most common are:

Runtime	Purpose
CUDA 13 Runtime	NVIDIA GPU acceleration (modern GPUs)
CUDA 12 Runtime	NVIDIA GPU acceleration (older GPUs)
ROCm/HIP Runtime	AMD GPU acceleration
oneAPI/SYCL Runtime	Intel GPU acceleration

Automatic Installation¶

When you install an engine that requires a runtime:

Backend.AI GO detects the missing dependency.
The runtime is downloaded automatically.
Both progress indicators appear in the download queue.
The runtime is installed before the engine finishes.

Viewing Installed Runtimes¶

The Installed Runtimes section shows:

Runtime name and version
Installation date
Which engines depend on this runtime

Click a runtime card to see the full list of dependent engines.

Runtime Persistence¶

Runtimes are shared across engine versions. If you:

Update an engine: The runtime remains intact
Uninstall all engines using a runtime: The runtime stays (for future use)
Manually delete a runtime: Dependent engines may stop working

Hardware Detection¶

Backend.AI GO automatically detects your system hardware to recommend the best engine variants.

What's Detected¶

GPU Vendor: NVIDIA, AMD, Intel, or Apple
GPU Model: Specific card name (e.g., RTX 4090, RX 7900 XTX)
Driver Version: CUDA version, ROCm version, etc.
VRAM: Available video memory
Disk Space: Available storage for engine installation

Viewing System Capabilities¶

The system capabilities are shown when selecting engine variants:

Recommended badge on the best variant for your hardware
Available accelerators listed in the install dialog
Disk space warnings if storage is low

Updating Engines¶

Checking for Updates¶

Backend.AI GO periodically checks the registry for new engine versions. When an update is available:

An orange Update Available badge appears on the engine card
The version number of the update is displayed

Applying Updates¶

Click the refresh icon on the engine card.
The new version downloads and replaces the old one.
Your settings and default preferences are preserved.

Active Engines

You cannot update an engine while it's running a model. Stop the model first, then update.

Troubleshooting¶

Engine Won't Install¶

Symptom	Solution
Download fails	Check internet connection; try again later
Extraction fails	Ensure sufficient disk space
Verification fails	Package may be corrupted; re-download
Missing runtime	Runtime download may have failed; check manually

Engine Won't Start¶

Symptom	Solution
"Library not found"	Runtime dependency missing; reinstall engine
"GPU not detected"	Update GPU drivers; try CPU variant
Crashes immediately	Check system requirements; try different variant

Clearing Stuck Downloads¶

If a download appears stuck:

Click the Cancel button in the download queue.
Wait for cleanup to complete.
Try the installation again.

Cannot Cancel During Verification

The cancel button is disabled during the verification stage to prevent file corruption.

Best Practices¶

Choose the Right Variant¶

For maximum performance: Match the engine variant to your GPU
For compatibility: CPU variants work everywhere but are slower
For memory-constrained systems: Some variants are more memory-efficient

Keep Engines Updated¶

New versions often include:

Performance improvements
Bug fixes
Support for new model architectures
Security patches

Manage Disk Space¶

Engine packages can be large (100MB - 1GB+). Periodically:

Remove unused engine variants
Check the Files tab to see installation sizes
Keep only the variants you actively use

Integration with Model Loading¶

The Engines page works closely with the model loading system:

Model Format Detection: When you load a model, Backend.AI GO checks its format (GGUF, Safetensors, etc.).
Engine Resolution: The app finds an installed engine that supports that format.
Format-Specific Defaults: Check your configured default engine for the format in Settings.
Priority-Based Selection: If no default is set, engines are selected based on priority:
- For Safetensors format: MLXcel > MLX LM
- For GGUF format: llama.cpp (with GPU acceleration preferred)
Hardware Optimization: Within the same priority level, GPU-accelerated variants are preferred over CPU-only.

Setting Format Defaults

Go to Settings > Inference > Default engines by format to configure which engine should be used for each model format. This is especially useful when you have both MLXcel and MLX LM installed and want to choose which one handles Safetensors models.

For more details on model loading, see:

Technical Details¶

Engine Package Format¶

Engine packages use the .baiengine format:

ZIP archive containing binaries and metadata
manifest.json describes the engine and its requirements
Checksums ensure file integrity
Platform-specific builds for each OS/architecture

Directory Structure¶

~/.backend-ai-go/
├── engines/
│   ├── incoming/              # Offline package staging
│   ├── installed.json         # Installed engines registry
│   ├── llama-cpp-metal/       # Engine: llama.cpp with Metal
│   ├── llama-cpp-cuda13/      # Engine: llama.cpp with CUDA 13
│   └── mlx/                   # Engine: MLX
├── runtimes/
│   ├── installed.json         # Installed runtimes registry
│   ├── cuda13-runtime/        # Runtime: CUDA 13
│   └── hip-runtime/           # Runtime: AMD HIP
├── logs/                      # Application logs
└── config/                    # Configuration files

Supported Accelerators¶

Accelerator	Platform	Description
metal	macOS	Apple Metal for M-series chips
cuda	Windows, Linux	NVIDIA CUDA
rocm	Linux	AMD ROCm
hip	Windows	AMD HIP (Windows port of ROCm)
vulkan	All	Cross-platform GPU API
sycl	All	Intel oneAPI SYCL
cpu	All	Fallback CPU-only inference

The Engines system ensures you always have the optimal inference backend for your hardware, making local AI both accessible and performant.