Skip to content

10.7. CLI Reference

The aigo CLI tool provides command-line access to the Backend.AI GO Management API. Use this tool to manage local models, control inference servers, monitor system resources, and interact with loaded models from the terminal.

Installation

The CLI is included with the Backend.AI GO distribution. If you are building from source:

cd cli
cargo install --path .

Usage

aigo [OPTIONS] <COMMAND>

Auto-Discovery

When --endpoint is not specified, the CLI automatically discovers a running Backend.AI GO instance by reading a discovery file written by the Management API server at startup. No configuration is required for the most common case of connecting to a locally running instance.

Endpoint resolution order:

  1. --endpoint flag or BACKEND_AI_GO_ENDPOINT environment variable (explicit override)
  2. Config file endpoint (if changed from the default via aigo config set endpoint ...)
  3. Auto-discovery file (if a local instance is running and healthy)
  4. Default fallback: http://127.0.0.1:8001

Discovery file locations by OS:

  • macOS: ~/Library/Application Support/ai.backend.go/mgmt-api.json
  • Linux: $XDG_RUNTIME_DIR/ai.backend.go/mgmt-api.json (fallback: ~/.config/ai.backend.go/mgmt-api.json)
  • Windows: %APPDATA%\ai.backend.go\mgmt-api.json

Before connecting, the CLI validates the discovery file by checking that the server process (identified by PID) is still running and that the endpoint responds to a health check. Stale files from crashed instances are silently ignored.

Global Options

Option Short Environment Variable Description
--endpoint -e BACKEND_AI_GO_ENDPOINT Management API endpoint (URL or configured name). Overrides auto-discovery.
--token -t BACKEND_AI_GO_TOKEN API authentication token.
--output -o BACKEND_AI_GO_OUTPUT Output format: console, json, yaml.
--quiet -q Suppress non-essential output.
--verbose -v Enable verbose output.
--no-verify-ssl Skip SSL certificate verification.

Commands

chat - One-Shot Chat Completion

Send a single message to a loaded model and print the response.

aigo chat [OPTIONS] [MESSAGE]

If MESSAGE is omitted, input is read from stdin (up to 1 MiB).

Options:

Option Short Description
--model <MODEL> -m Model to use for completion.
--max-tokens <INT> Maximum tokens to generate (default: 1024).
--temperature <FLOAT> Sampling temperature 0.0–2.0 (default: 0.7). Ignored when --reasoning-effort is set.
--system <PROMPT> -s System prompt to prepend.
--reasoning-effort <LEVEL> Reasoning effort level for hybrid-thinking models. Accepted values: none, low, medium, high, xhigh. Use none to disable thinking mode via chat_template_kwargs.
--no-think Disable thinking mode (sets chat_template_kwargs.enable_thinking=false). Takes precedence over --reasoning-effort.
--thinking-budget <N> Per-request cap on tokens emitted inside the <think> block (sent as thinking_budget_tokens in the request body). -1 = unlimited (engine default), 0 = immediate end (disables thinking), N>0 = hard cap of N tokens. Engine-agnostic: works on both llama-server and mlxcel-server.
--preserve-thinking Retain <think> blocks from all prior assistant turns instead of stripping them (Qwen3.6+ feature). Sets chat_template_kwargs.preserve_thinking=true. Orthogonal to --no-think / --reasoning-effort — both kwargs coexist when flags are combined. Older Qwen3/3.5 models accept the flag but behavior is unvalidated.

When --reasoning-effort is set to a level other than none, the request sends both reasoning_effort and chat_template_kwargs: {"enable_thinking": true}. When set to none, or when --no-think is passed, only chat_template_kwargs: {"enable_thinking": false} is sent, which is the correct way to suppress the <think> block on Qwen3/3.5 hybrid-thinking models.

--thinking-budget and --preserve-thinking are independent of --reasoning-effort: the budget caps how many tokens the model can emit inside <think>, and preserve_thinking controls whether prior <think> blocks survive in the prompt. Both fields travel in the per-request HTTP body, so they are forwarded unchanged to llama-server and mlxcel-server (and through the continuum-router passthrough path).

Examples:

# Basic chat
aigo chat "What is the capital of France?"

# Disable thinking mode on a Qwen3 model
aigo chat --no-think "Summarize this document" < report.txt

# Enable thinking with medium effort
aigo chat --reasoning-effort medium "Solve this step by step: ..."

# Cap thinking at 64 tokens (force concise reasoning)
aigo chat --thinking-budget 64 --reasoning-effort high "Quick: 2+2=?"

# Disable thinking via the budget (equivalent to --no-think for engines that implement it)
aigo chat --thinking-budget 0 "Just answer directly."

# Preserve <think> blocks across turns on Qwen3.6+ (improves agent KV cache reuse)
aigo chat --preserve-thinking --reasoning-effort high "Continue solving from where we left off."

# Pipe input with a system prompt
echo "SELECT * FROM users" | aigo chat --system "You are a SQL expert."

complete - One-Shot Text Completion

Send a prompt for text completion (non-chat format).

aigo complete [OPTIONS] [PROMPT]

If PROMPT is omitted, input is read from stdin.

Options:

Option Short Description
--model <MODEL> -m Model to use.
--max-tokens <INT> Maximum tokens to generate (default: 256).
--temperature <FLOAT> Sampling temperature 0.0–2.0 (default: 0.7).

config - Configuration Management

Manage CLI configuration settings.

  • aigo config path: Show configuration file path.
  • aigo config get <KEY>: Get a configuration value.
  • aigo config set <KEY> <VALUE>: Set a configuration value.
  • aigo config list: List all configuration values.
  • aigo config reset: Reset configuration to defaults.

model - Local Model Management

Manage models stored on the local disk.

  • aigo model list: List all local models.
  • aigo model info <MODEL_ID>: Get detailed information about a specific model.
  • aigo model refresh: Refresh the model index (scan for new files).

loaded - Loaded Model Operations

Control models currently loaded into memory for inference.

  • aigo loaded list: List currently loaded models.
  • aigo loaded info <ID>: Get details of a loaded model instance.
  • aigo loaded load [OPTIONS] <MODEL_ID>: Load a model into memory.
    • Options:
      • -c, --context-length <INT>: Override context length.
      • -g, --gpu-layers <INT>: Number of layers to offload to GPU (-1 for all).
      • -t, --threads <INT>: Number of threads to use.
      • -a, --alias <STRING>: Model alias for routing.
      • --tool-calling: Enable tool calling capabilities.
      • --mmproj <PATH>: Path to mmproj file for vision models.
  • aigo loaded unload <ID>: Unload a model to free resources.
  • aigo loaded health <ID>: Check the health status of a loaded model.

router - Router Control

Manage the Continuum Router service.

  • aigo router status: Get the current status of the router.
  • aigo router start: Start the router service.
  • aigo router stop: Stop the router service.
  • aigo router restart: Restart the router service.

system - System Monitoring

Monitor hardware resources and API status.

  • aigo system info: Get general system information (OS, Architecture).
  • aigo system metrics: Get current system metrics (CPU, RAM usage).
  • aigo system gpu: Get detailed GPU information.
  • aigo system health: Check the overall API health.
  • aigo system version: Get the API server version.

Examples

List all available models in JSON format:

aigo model list -o json

Load a model with custom GPU layers:

aigo loaded load "gemma-3n-E4B-it-Q4_K_M" --gpu-layers 33

Check system GPU status:

aigo system gpu