Skip to content

4.2. Engine & Runtime Management

The Engines menu is your central hub for managing inference engines and their runtime dependencies. Here you can download, update, and configure different versions of llama.cpp, MLX, and other inference backends optimized for your specific hardware.

Understanding the Engines Page

The Engines page is divided into three main sections:

  1. Installed Engines - Engines currently available on your system
  2. Installed Runtimes - Runtime libraries (CUDA, ROCm, etc.) that engines depend on
  3. Available Engines - New engines you can download from the official registry

Engine management Engine management

Why Manage Engines?

Hardware-Specific Optimization

Different GPUs require different engine builds:

Hardware Optimal Engine Variant
NVIDIA RTX/GeForce llama.cpp with CUDA 13
NVIDIA (older) llama.cpp with CUDA 12
AMD Radeon/Instinct llama.cpp with ROCm/HIP
Intel Arc llama.cpp with SYCL or Vulkan
Apple Silicon llama.cpp with Metal, MLX, or MLXcel
CPU-only llama.cpp CPU (AVX2/AVX-512)

Supported Inference Engines

Backend.AI GO supports multiple inference engines:

Engine Supported Formats Platform Description
llama.cpp GGUF All High-performance GGUF model inference
MLX LM Safetensors, GGUF macOS (Apple Silicon) Apple MLX-based inference
MLXcel Safetensors macOS (Apple Silicon) Lablup's optimized MLX serving engine

Format-Specific Default Engines

When multiple engines support the same format (e.g., MLX LM and MLXcel both support Safetensors), you can configure which engine to use by default in Settings > Inference > Default engines by format.

Version Control

Keep multiple engine versions installed side-by-side:

  • Test new releases before committing
  • Roll back if a new version has issues
  • Compare performance between versions

Dependency Management

Some engines require runtime libraries (like CUDA or ROCm). The Engines page automatically detects and manages these dependencies, installing them when needed.


Installing an Engine

From the Registry

  1. Navigate to the Engines page from the sidebar menu.

  2. Scroll to the Available Engines section.

  3. Find the engine you want (e.g., llama.cpp).

  4. Click the Download button.

  5. If multiple variants are available (e.g., CUDA 13, Metal, CPU), a dialog will appear:

    • Recommended variants are marked based on your detected hardware
    • Each variant shows its download size
    • Select the variant that matches your GPU
  6. The download begins immediately. Watch the progress in the floating Download Queue panel.

Download Progress Stages

The installation process goes through several stages:

Stage Description
Downloading Fetching the engine package from the registry
Extracting Unpacking the compressed archive
Verifying Checking file integrity via checksums
Installing Copying files to the final location

Runtime Dependencies

If an engine requires a runtime library (like CUDA) that isn't installed, it will be downloaded automatically. You'll see a separate progress indicator for runtime downloads.

Offline Installation

For air-gapped environments or when you prefer manual downloads:

  1. Download .baiengine package files from Backend.AI GO Releases.

  2. Place them in the incoming directory:

    ~/.backend-ai-go/engines/incoming/
    
    %APPDATA%\backend.ai-go\engines\incoming\
    
  3. Open the Engines page. A Pending Packages banner will appear.

  4. Click Import to install the detected packages.

Alternatively, you can drag and drop .baiengine files directly onto the Engines page.


Managing Installed Engines

Engine Cards

Each installed engine is displayed as a card showing:

  • Engine name and version (e.g., llama.cpp 1.0.0)
  • Accelerator badge (Metal, CUDA 13, CPU, etc.)
  • Status badges:
    • 🟢 Active - Currently running a model
    • 🟠 Update Available - Newer version in registry
  • Supported formats (GGUF, MLX, etc.)
  • Installation size

Actions

Action Description
Star icon Mark as default engine for this engine type
Refresh icon Update to the latest version (when available)
Trash icon Uninstall the engine
Card click Open the details drawer

Setting a Default Engine

When you have multiple variants of the same engine (e.g., both CUDA 13 and CPU versions of llama.cpp), you can set one as the default:

  1. Click the star icon on the engine card.
  2. The star becomes filled, indicating this is now the default.
  3. When loading models, Backend.AI GO will prefer this engine variant.

Multiple Defaults

You can have one default per base engine. For example, you might have a default llama.cpp (CUDA 13) and a default MLX engine simultaneously.


Engine Details Drawer

Click on any engine card to open the details drawer with three tabs:

Overview Tab

  • Basic Information: ID, version, accelerator, installation date
  • Manifest Details: Format version, upstream version, platform compatibility
  • Accelerator Information: Specific backend details and optimizations

Files Tab

Browse the installed files:

  • Directory structure
  • File sizes
  • Useful for troubleshooting or verifying installation

Dependencies Tab

View runtime dependencies:

  • Required vs. optional dependencies
  • Installation status of each dependency
  • Version requirements

Runtime Libraries

What are Runtimes?

Runtime libraries are shared dependencies that engines need to function. The most common are:

Runtime Purpose
CUDA 13 Runtime NVIDIA GPU acceleration (modern GPUs)
CUDA 12 Runtime NVIDIA GPU acceleration (older GPUs)
ROCm/HIP Runtime AMD GPU acceleration
oneAPI/SYCL Runtime Intel GPU acceleration

Automatic Installation

When you install an engine that requires a runtime:

  1. Backend.AI GO detects the missing dependency.
  2. The runtime is downloaded automatically.
  3. Both progress indicators appear in the download queue.
  4. The runtime is installed before the engine finishes.

Viewing Installed Runtimes

The Installed Runtimes section shows:

  • Runtime name and version
  • Installation date
  • Which engines depend on this runtime

Click a runtime card to see the full list of dependent engines.

Runtime Persistence

Runtimes are shared across engine versions. If you:

  • Update an engine: The runtime remains intact
  • Uninstall all engines using a runtime: The runtime stays (for future use)
  • Manually delete a runtime: Dependent engines may stop working

Hardware Detection

Backend.AI GO automatically detects your system hardware to recommend the best engine variants.

What's Detected

  • GPU Vendor: NVIDIA, AMD, Intel, or Apple
  • GPU Model: Specific card name (e.g., RTX 4090, RX 7900 XTX)
  • Driver Version: CUDA version, ROCm version, etc.
  • VRAM: Available video memory
  • Disk Space: Available storage for engine installation

Viewing System Capabilities

The system capabilities are shown when selecting engine variants:

  • Recommended badge on the best variant for your hardware
  • Available accelerators listed in the install dialog
  • Disk space warnings if storage is low

Updating Engines

Checking for Updates

Backend.AI GO periodically checks the registry for new engine versions. When an update is available:

  • An orange Update Available badge appears on the engine card
  • The version number of the update is displayed

Applying Updates

  1. Click the refresh icon on the engine card.
  2. The new version downloads and replaces the old one.
  3. Your settings and default preferences are preserved.

Active Engines

You cannot update an engine while it's running a model. Stop the model first, then update.


Troubleshooting

Engine Won't Install

Symptom Solution
Download fails Check internet connection; try again later
Extraction fails Ensure sufficient disk space
Verification fails Package may be corrupted; re-download
Missing runtime Runtime download may have failed; check manually

Engine Won't Start

Symptom Solution
"Library not found" Runtime dependency missing; reinstall engine
"GPU not detected" Update GPU drivers; try CPU variant
Crashes immediately Check system requirements; try different variant

Clearing Stuck Downloads

If a download appears stuck:

  1. Click the Cancel button in the download queue.
  2. Wait for cleanup to complete.
  3. Try the installation again.

Cannot Cancel During Verification

The cancel button is disabled during the verification stage to prevent file corruption.


Best Practices

Choose the Right Variant

  • For maximum performance: Match the engine variant to your GPU
  • For compatibility: CPU variants work everywhere but are slower
  • For memory-constrained systems: Some variants are more memory-efficient

Keep Engines Updated

New versions often include:

  • Performance improvements
  • Bug fixes
  • Support for new model architectures
  • Security patches

Manage Disk Space

Engine packages can be large (100MB - 1GB+). Periodically:

  • Remove unused engine variants
  • Check the Files tab to see installation sizes
  • Keep only the variants you actively use

Integration with Model Loading

The Engines page works closely with the model loading system:

  1. Model Format Detection: When you load a model, Backend.AI GO checks its format (GGUF, Safetensors, etc.).

  2. Engine Resolution: The app finds an installed engine that supports that format.

  3. Format-Specific Defaults: Check your configured default engine for the format in Settings.

  4. Priority-Based Selection: If no default is set, engines are selected based on priority:

    • For Safetensors format: MLXcel > MLX LM
    • For GGUF format: llama.cpp (with GPU acceleration preferred)
  5. Hardware Optimization: Within the same priority level, GPU-accelerated variants are preferred over CPU-only.

Setting Format Defaults

Go to Settings > Inference > Default engines by format to configure which engine should be used for each model format. This is especially useful when you have both MLXcel and MLX LM installed and want to choose which one handles Safetensors models.

For more details on model loading, see:


Technical Details

Engine Package Format

Engine packages use the .baiengine format:

  • ZIP archive containing binaries and metadata
  • manifest.json describes the engine and its requirements
  • Checksums ensure file integrity
  • Platform-specific builds for each OS/architecture

Directory Structure

~/.backend-ai-go/
├── engines/
│   ├── incoming/              # Offline package staging
│   ├── installed.json         # Installed engines registry
│   ├── llama-cpp-metal/       # Engine: llama.cpp with Metal
│   ├── llama-cpp-cuda13/      # Engine: llama.cpp with CUDA 13
│   └── mlx/                   # Engine: MLX
├── runtimes/
│   ├── installed.json         # Installed runtimes registry
│   ├── cuda13-runtime/        # Runtime: CUDA 13
│   └── hip-runtime/           # Runtime: AMD HIP
├── logs/                      # Application logs
└── config/                    # Configuration files

Supported Accelerators

Accelerator Platform Description
metal macOS Apple Metal for M-series chips
cuda Windows, Linux NVIDIA CUDA
rocm Linux AMD ROCm
hip Windows AMD HIP (Windows port of ROCm)
vulkan All Cross-platform GPU API
sycl All Intel oneAPI SYCL
cpu All Fallback CPU-only inference

The Engines system ensures you always have the optimal inference backend for your hardware, making local AI both accessible and performant.