10.7. CLI Reference¶
The aigo CLI tool provides command-line access to the Backend.AI GO Management API. Use this tool to manage local models, control inference servers, monitor system resources, and interact with loaded models from the terminal.
Installation¶
The CLI is included with the Backend.AI GO distribution. If you are building from source:
Usage¶
Auto-Discovery¶
When --endpoint is not specified, the CLI automatically discovers a running Backend.AI GO instance by reading a discovery file written by the Management API server at startup. No configuration is required for the most common case of connecting to a locally running instance.
Endpoint resolution order:
--endpointflag orBACKEND_AI_GO_ENDPOINTenvironment variable (explicit override)- Config file endpoint (if changed from the default via
aigo config set endpoint ...) - Auto-discovery file (if a local instance is running and healthy)
- Default fallback:
http://127.0.0.1:8001
Discovery file locations by OS:
- macOS:
~/Library/Application Support/ai.backend.go/mgmt-api.json - Linux:
$XDG_RUNTIME_DIR/ai.backend.go/mgmt-api.json(fallback:~/.config/ai.backend.go/mgmt-api.json) - Windows:
%APPDATA%\ai.backend.go\mgmt-api.json
Before connecting, the CLI validates the discovery file by checking that the server process (identified by PID) is still running and that the endpoint responds to a health check. Stale files from crashed instances are silently ignored.
Global Options¶
| Option | Short | Environment Variable | Description |
|---|---|---|---|
--endpoint | -e | BACKEND_AI_GO_ENDPOINT | Management API endpoint (URL or configured name). Overrides auto-discovery. |
--token | -t | BACKEND_AI_GO_TOKEN | API authentication token. |
--output | -o | BACKEND_AI_GO_OUTPUT | Output format: console, json, yaml. |
--quiet | -q | Suppress non-essential output. | |
--verbose | -v | Enable verbose output. | |
--no-verify-ssl | Skip SSL certificate verification. |
Commands¶
chat - One-Shot Chat Completion¶
Send a single message to a loaded model and print the response.
If MESSAGE is omitted, input is read from stdin (up to 1 MiB).
Options:
| Option | Short | Description |
|---|---|---|
--model <MODEL> | -m | Model to use for completion. |
--max-tokens <INT> | Maximum tokens to generate (default: 1024). | |
--temperature <FLOAT> | Sampling temperature 0.0–2.0 (default: 0.7). Ignored when --reasoning-effort is set. | |
--system <PROMPT> | -s | System prompt to prepend. |
--reasoning-effort <LEVEL> | Reasoning effort level for hybrid-thinking models. Accepted values: none, low, medium, high, xhigh. Use none to disable thinking mode via chat_template_kwargs. | |
--no-think | Disable thinking mode (sets chat_template_kwargs.enable_thinking=false). Takes precedence over --reasoning-effort. | |
--thinking-budget <N> | Per-request cap on tokens emitted inside the <think> block (sent as thinking_budget_tokens in the request body). -1 = unlimited (engine default), 0 = immediate end (disables thinking), N>0 = hard cap of N tokens. Engine-agnostic: works on both llama-server and mlxcel-server. | |
--preserve-thinking | Retain <think> blocks from all prior assistant turns instead of stripping them (Qwen3.6+ feature). Sets chat_template_kwargs.preserve_thinking=true. Orthogonal to --no-think / --reasoning-effort — both kwargs coexist when flags are combined. Older Qwen3/3.5 models accept the flag but behavior is unvalidated. |
When --reasoning-effort is set to a level other than none, the request sends both reasoning_effort and chat_template_kwargs: {"enable_thinking": true}. When set to none, or when --no-think is passed, only chat_template_kwargs: {"enable_thinking": false} is sent, which is the correct way to suppress the <think> block on Qwen3/3.5 hybrid-thinking models.
--thinking-budget and --preserve-thinking are independent of --reasoning-effort: the budget caps how many tokens the model can emit inside <think>, and preserve_thinking controls whether prior <think> blocks survive in the prompt. Both fields travel in the per-request HTTP body, so they are forwarded unchanged to llama-server and mlxcel-server (and through the continuum-router passthrough path).
Examples:
# Basic chat
aigo chat "What is the capital of France?"
# Disable thinking mode on a Qwen3 model
aigo chat --no-think "Summarize this document" < report.txt
# Enable thinking with medium effort
aigo chat --reasoning-effort medium "Solve this step by step: ..."
# Cap thinking at 64 tokens (force concise reasoning)
aigo chat --thinking-budget 64 --reasoning-effort high "Quick: 2+2=?"
# Disable thinking via the budget (equivalent to --no-think for engines that implement it)
aigo chat --thinking-budget 0 "Just answer directly."
# Preserve <think> blocks across turns on Qwen3.6+ (improves agent KV cache reuse)
aigo chat --preserve-thinking --reasoning-effort high "Continue solving from where we left off."
# Pipe input with a system prompt
echo "SELECT * FROM users" | aigo chat --system "You are a SQL expert."
complete - One-Shot Text Completion¶
Send a prompt for text completion (non-chat format).
If PROMPT is omitted, input is read from stdin.
Options:
| Option | Short | Description |
|---|---|---|
--model <MODEL> | -m | Model to use. |
--max-tokens <INT> | Maximum tokens to generate (default: 256). | |
--temperature <FLOAT> | Sampling temperature 0.0–2.0 (default: 0.7). |
config - Configuration Management¶
Manage CLI configuration settings.
aigo config path: Show configuration file path.aigo config get <KEY>: Get a configuration value.aigo config set <KEY> <VALUE>: Set a configuration value.aigo config list: List all configuration values.aigo config reset: Reset configuration to defaults.
model - Local Model Management¶
Manage models stored on the local disk.
aigo model list: List all local models.aigo model info <MODEL_ID>: Get detailed information about a specific model.aigo model refresh: Refresh the model index (scan for new files).
loaded - Loaded Model Operations¶
Control models currently loaded into memory for inference.
aigo loaded list: List currently loaded models.aigo loaded info <ID>: Get details of a loaded model instance.aigo loaded load [OPTIONS] <MODEL_ID>: Load a model into memory.- Options:
-c, --context-length <INT>: Override context length.-g, --gpu-layers <INT>: Number of layers to offload to GPU (-1 for all).-t, --threads <INT>: Number of threads to use.-a, --alias <STRING>: Model alias for routing.--tool-calling: Enable tool calling capabilities.--mmproj <PATH>: Path to mmproj file for vision models.
- Options:
aigo loaded unload <ID>: Unload a model to free resources.aigo loaded health <ID>: Check the health status of a loaded model.
router - Router Control¶
Manage the Continuum Router service.
aigo router status: Get the current status of the router.aigo router start: Start the router service.aigo router stop: Stop the router service.aigo router restart: Restart the router service.
system - System Monitoring¶
Monitor hardware resources and API status.
aigo system info: Get general system information (OS, Architecture).aigo system metrics: Get current system metrics (CPU, RAM usage).aigo system gpu: Get detailed GPU information.aigo system health: Check the overall API health.aigo system version: Get the API server version.
Examples¶
List all available models in JSON format:
Load a model with custom GPU layers:
Check system GPU status: