4.2. Engine & Runtime Management¶
The Engines menu is your central hub for managing inference engines and their runtime dependencies. Here you can download, update, and configure different versions of llama.cpp, MLX, and other inference backends optimized for your specific hardware.
Understanding the Engines Page¶
The Engines page is divided into three main sections:
- Installed Engines - Engines currently available on your system
- Installed Runtimes - Runtime libraries (CUDA, ROCm, etc.) that engines depend on
- Available Engines - New engines you can download from the official registry

Why Manage Engines?¶
Hardware-Specific Optimization¶
Different GPUs require different engine builds:
| Hardware | Optimal Engine Variant |
|---|---|
| NVIDIA RTX/GeForce | llama.cpp with CUDA 13 |
| NVIDIA (older) | llama.cpp with CUDA 12 |
| AMD Radeon/Instinct | llama.cpp with ROCm/HIP |
| Intel Arc | llama.cpp with SYCL or Vulkan |
| Apple Silicon | llama.cpp with Metal, MLX, or MLXcel |
| CPU-only | llama.cpp CPU (AVX2/AVX-512) |
Supported Inference Engines¶
Backend.AI GO supports multiple inference engines:
| Engine | Supported Formats | Platform | Description |
|---|---|---|---|
| llama.cpp | GGUF | All | High-performance GGUF model inference |
| MLX LM | Safetensors, GGUF | macOS (Apple Silicon) | Apple MLX-based inference |
| MLXcel | Safetensors | macOS (Apple Silicon) | Lablup's optimized MLX serving engine |
Format-Specific Default Engines
When multiple engines support the same format (e.g., MLX LM and MLXcel both support Safetensors), you can configure which engine to use by default in Settings > Inference > Default engines by format.
Version Control¶
Keep multiple engine versions installed side-by-side:
- Test new releases before committing
- Roll back if a new version has issues
- Compare performance between versions
Dependency Management¶
Some engines require runtime libraries (like CUDA or ROCm). The Engines page automatically detects and manages these dependencies, installing them when needed.
Installing an Engine¶
From the Registry¶
-
Navigate to the Engines page from the sidebar menu.
-
Scroll to the Available Engines section.
-
Find the engine you want (e.g.,
llama.cpp). -
Click the Download button.
-
If multiple variants are available (e.g., CUDA 13, Metal, CPU), a dialog will appear:
- Recommended variants are marked based on your detected hardware
- Each variant shows its download size
- Select the variant that matches your GPU
-
The download begins immediately. Watch the progress in the floating Download Queue panel.
Download Progress Stages¶
The installation process goes through several stages:
| Stage | Description |
|---|---|
| Downloading | Fetching the engine package from the registry |
| Extracting | Unpacking the compressed archive |
| Verifying | Checking file integrity via checksums |
| Installing | Copying files to the final location |
Runtime Dependencies
If an engine requires a runtime library (like CUDA) that isn't installed, it will be downloaded automatically. You'll see a separate progress indicator for runtime downloads.
Offline Installation¶
For air-gapped environments or when you prefer manual downloads:
-
Download
.baienginepackage files from Backend.AI GO Releases. -
Place them in the incoming directory:
-
Open the Engines page. A Pending Packages banner will appear.
-
Click Import to install the detected packages.
Alternatively, you can drag and drop .baiengine files directly onto the Engines page.
Managing Installed Engines¶
Engine Cards¶
Each installed engine is displayed as a card showing:
- Engine name and version (e.g., llama.cpp 1.0.0)
- Accelerator badge (Metal, CUDA 13, CPU, etc.)
- Status badges:
- 🟢 Active - Currently running a model
- 🟠 Update Available - Newer version in registry
- Supported formats (GGUF, MLX, etc.)
- Installation size
Actions¶
| Action | Description |
|---|---|
| Star icon | Mark as default engine for this engine type |
| Refresh icon | Update to the latest version (when available) |
| Trash icon | Uninstall the engine |
| Card click | Open the details drawer |
Setting a Default Engine¶
When you have multiple variants of the same engine (e.g., both CUDA 13 and CPU versions of llama.cpp), you can set one as the default:
- Click the star icon on the engine card.
- The star becomes filled, indicating this is now the default.
- When loading models, Backend.AI GO will prefer this engine variant.
Multiple Defaults
You can have one default per base engine. For example, you might have a default llama.cpp (CUDA 13) and a default MLX engine simultaneously.
Engine Details Drawer¶
Click on any engine card to open the details drawer with three tabs:
Overview Tab¶
- Basic Information: ID, version, accelerator, installation date
- Manifest Details: Format version, upstream version, platform compatibility
- Accelerator Information: Specific backend details and optimizations
Files Tab¶
Browse the installed files:
- Directory structure
- File sizes
- Useful for troubleshooting or verifying installation
Dependencies Tab¶
View runtime dependencies:
- Required vs. optional dependencies
- Installation status of each dependency
- Version requirements
Runtime Libraries¶
What are Runtimes?¶
Runtime libraries are shared dependencies that engines need to function. The most common are:
| Runtime | Purpose |
|---|---|
| CUDA 13 Runtime | NVIDIA GPU acceleration (modern GPUs) |
| CUDA 12 Runtime | NVIDIA GPU acceleration (older GPUs) |
| ROCm/HIP Runtime | AMD GPU acceleration |
| oneAPI/SYCL Runtime | Intel GPU acceleration |
Automatic Installation¶
When you install an engine that requires a runtime:
- Backend.AI GO detects the missing dependency.
- The runtime is downloaded automatically.
- Both progress indicators appear in the download queue.
- The runtime is installed before the engine finishes.
Viewing Installed Runtimes¶
The Installed Runtimes section shows:
- Runtime name and version
- Installation date
- Which engines depend on this runtime
Click a runtime card to see the full list of dependent engines.
Runtime Persistence¶
Runtimes are shared across engine versions. If you:
- Update an engine: The runtime remains intact
- Uninstall all engines using a runtime: The runtime stays (for future use)
- Manually delete a runtime: Dependent engines may stop working
Hardware Detection¶
Backend.AI GO automatically detects your system hardware to recommend the best engine variants.
What's Detected¶
- GPU Vendor: NVIDIA, AMD, Intel, or Apple
- GPU Model: Specific card name (e.g., RTX 4090, RX 7900 XTX)
- Driver Version: CUDA version, ROCm version, etc.
- VRAM: Available video memory
- Disk Space: Available storage for engine installation
Viewing System Capabilities¶
The system capabilities are shown when selecting engine variants:
- Recommended badge on the best variant for your hardware
- Available accelerators listed in the install dialog
- Disk space warnings if storage is low
Updating Engines¶
Checking for Updates¶
Backend.AI GO periodically checks the registry for new engine versions. When an update is available:
- An orange Update Available badge appears on the engine card
- The version number of the update is displayed
Applying Updates¶
- Click the refresh icon on the engine card.
- The new version downloads and replaces the old one.
- Your settings and default preferences are preserved.
Active Engines
You cannot update an engine while it's running a model. Stop the model first, then update.
Troubleshooting¶
Engine Won't Install¶
| Symptom | Solution |
|---|---|
| Download fails | Check internet connection; try again later |
| Extraction fails | Ensure sufficient disk space |
| Verification fails | Package may be corrupted; re-download |
| Missing runtime | Runtime download may have failed; check manually |
Engine Won't Start¶
| Symptom | Solution |
|---|---|
| "Library not found" | Runtime dependency missing; reinstall engine |
| "GPU not detected" | Update GPU drivers; try CPU variant |
| Crashes immediately | Check system requirements; try different variant |
Clearing Stuck Downloads¶
If a download appears stuck:
- Click the Cancel button in the download queue.
- Wait for cleanup to complete.
- Try the installation again.
Cannot Cancel During Verification
The cancel button is disabled during the verification stage to prevent file corruption.
Best Practices¶
Choose the Right Variant¶
- For maximum performance: Match the engine variant to your GPU
- For compatibility: CPU variants work everywhere but are slower
- For memory-constrained systems: Some variants are more memory-efficient
Keep Engines Updated¶
New versions often include:
- Performance improvements
- Bug fixes
- Support for new model architectures
- Security patches
Manage Disk Space¶
Engine packages can be large (100MB - 1GB+). Periodically:
- Remove unused engine variants
- Check the Files tab to see installation sizes
- Keep only the variants you actively use
Integration with Model Loading¶
The Engines page works closely with the model loading system:
-
Model Format Detection: When you load a model, Backend.AI GO checks its format (GGUF, Safetensors, etc.).
-
Engine Resolution: The app finds an installed engine that supports that format.
-
Format-Specific Defaults: Check your configured default engine for the format in Settings.
-
Priority-Based Selection: If no default is set, engines are selected based on priority:
- For Safetensors format: MLXcel > MLX LM
- For GGUF format: llama.cpp (with GPU acceleration preferred)
-
Hardware Optimization: Within the same priority level, GPU-accelerated variants are preferred over CPU-only.
Setting Format Defaults
Go to Settings > Inference > Default engines by format to configure which engine should be used for each model format. This is especially useful when you have both MLXcel and MLX LM installed and want to choose which one handles Safetensors models.
For more details on model loading, see:
Technical Details¶
Engine Package Format¶
Engine packages use the .baiengine format:
- ZIP archive containing binaries and metadata
manifest.jsondescribes the engine and its requirements- Checksums ensure file integrity
- Platform-specific builds for each OS/architecture
Directory Structure¶
~/.backend-ai-go/
├── engines/
│ ├── incoming/ # Offline package staging
│ ├── installed.json # Installed engines registry
│ ├── llama-cpp-metal/ # Engine: llama.cpp with Metal
│ ├── llama-cpp-cuda13/ # Engine: llama.cpp with CUDA 13
│ └── mlx/ # Engine: MLX
├── runtimes/
│ ├── installed.json # Installed runtimes registry
│ ├── cuda13-runtime/ # Runtime: CUDA 13
│ └── hip-runtime/ # Runtime: AMD HIP
├── logs/ # Application logs
└── config/ # Configuration files
Supported Accelerators¶
| Accelerator | Platform | Description |
|---|---|---|
| metal | macOS | Apple Metal for M-series chips |
| cuda | Windows, Linux | NVIDIA CUDA |
| rocm | Linux | AMD ROCm |
| hip | Windows | AMD HIP (Windows port of ROCm) |
| vulkan | All | Cross-platform GPU API |
| sycl | All | Intel oneAPI SYCL |
| cpu | All | Fallback CPU-only inference |
The Engines system ensures you always have the optimal inference backend for your hardware, making local AI both accessible and performant.