10.4. Distributed Inference Routing¶

When load balancing is enabled, Backend.AI GO transparently routes inference requests to remote nodes. If the requested model is not loaded locally, the inference proxy looks it up in the distributed model index and forwards the request to a remote node that has the model available.

How It Works¶

The inference proxy uses a local-first strategy:

Local resolution: Check whether the requested model is loaded locally, either through the Continuum Router port or directly on a model server port.
Remote fallback: If the model is not found locally and load balancing is enabled, the InferenceRouter resolves the model to a reachable remote node.
Authenticated forwarding: The proxy forwards the request to the remote node's API, adding an Authorization: Bearer header with the node's API key.

This process is transparent to the caller. The request and response format is unchanged — the routing layer is fully invisible to clients using the OpenAI-compatible API.

Enabling Distributed Routing¶

Distributed routing is controlled by the load balancing configuration. To enable it:

Open Settings > Nodes.
Enable Load Balancing.
Choose a routing strategy (see below).

When load balancing is disabled, all inference requests are handled locally. When it is enabled and the local node does not have the requested model, requests are automatically routed to a remote node.

Routing Strategies¶

The InferenceRouter supports four strategies:

Strategy	Behavior
`priority`	Prefer the highest-priority node (configurable per node).
`round-robin`	Cycle through available nodes in order.
`least-loaded`	Route to the node with the fewest active in-flight requests (real-time load).
`fastest`	Route to the node with the lowest measured latency.

Configure the strategy in Settings > Nodes > Load Balancing.

Circuit Breaker¶

Each remote node has an associated circuit breaker that protects the pool from cascading failures.

Threshold: After 3 consecutive failures to a node, the circuit opens and that node is excluded from candidate selection.
Reset: After 60 seconds, the circuit transitions to half-open, allowing a single probe request through. A successful probe closes the circuit; another failure reopens it immediately.
Effect on routing: Open-circuit nodes are skipped during endpoint resolution. If all candidates have open circuits, the request returns 503.

The circuit breaker is independent of the failover mechanism — a node with an open circuit will not receive requests even as a failover target.

Automatic Failover¶

When failoverEnabled is true in the load balancing configuration (the default), the proxy will retry failed requests on an alternate node.

Failover triggers when:

A connection error occurs reaching the selected node, or
The node responds with a 5xx status code.

On trigger, the proxy re-resolves the endpoint excluding the failed node and retries the request once. The original request body is replayed in full.

Streaming Failover Limitation¶

Failover is only possible before the first SSE chunk is sent to the client. Once streaming has begun, the response is committed and the error propagates directly to the client. Non-streaming requests support full retry on an alternate node.

Merged Model List¶

The GET /api/v1/inference/models (OpenAI-compatible) endpoint returns a merged list:

Local models — reported directly by the local inference server.
Remote models — discovered via the distributed model index, marked with "owned_by": "remote".

Duplicate model IDs are deduplicated; local models always take precedence.

{
  "object": "list",
  "data": [
    { "id": "llama-3-8b-instruct", "object": "model", "owned_by": "llama.cpp" },
    { "id": "mistral-7b-instruct", "object": "model", "owned_by": "remote" }
  ]
}

Distributed Pool Status¶

Two API endpoints expose the state of the distributed pool.

Inference Status (summary)¶

GET /api/v1/inference/status returns a distributed field when load balancing is enabled:

{
  "available": true,
  "routerRunning": true,
  "routerPort": 8080,
  "loadedModels": 1,
  "llmModels": 1,
  "diffusionModels": 0,
  "distributed": {
    "enabled": true,
    "strategy": "priority",
    "totalNodes": 2,
    "onlineNodes": 2,
    "remoteModels": 3
  }
}

Full Distributed Status¶

GET /api/v1/pool/distributed-status returns the complete pool view including per-node details:

{
  "enabled": true,
  "strategy": "priority",
  "totalNodes": 2,
  "onlineNodes": 2,
  "models": [
    {
      "id": "mistral-7b-instruct",
      "availableOn": ["node-fingerprint-abc"]
    }
  ],
  "endpoints": [
    {
      "fingerprint": "node-fingerprint-abc",
      "baseUrl": "http://192.168.1.50:8080",
      "status": "online",
      "latencyMs": 12,
      "loadedModelCount": 2,
      "priority": 5
    }
  ]
}

This endpoint is also accessible from the frontend via the get_distributed_pool_status Tauri IPC command.

Routing Statistics¶

GET /api/v1/pool/routing-stats returns per-node request metrics and circuit breaker state:

{
  "nodes": [
    {
      "fingerprint": "node-fingerprint-abc",
      "totalRequests": 150,
      "successCount": 147,
      "failureCount": 3,
      "avgLatencyMs": 42.5,
      "activeRequests": 2,
      "circuitOpen": false,
      "consecutiveFailures": 0
    }
  ],
  "totalRequests": 150,
  "totalSuccesses": 147,
  "totalFailures": 3,
  "circuitBreaker": {
    "failureThreshold": 3,
    "resetTimeoutSecs": 60
  }
}

The same data is also available via the get_routing_stats Tauri IPC command and the nodeStore.getRoutingStats() action in the frontend store.

Remote Node Authentication¶

When a request is routed to a remote node, the proxy adds an Authorization: Bearer <api_key> header. The API key is stored securely in the local node registry — it is never exposed in logs or the UI.

Ensure that the remote node has an API key configured. Requests to nodes without an API key will fail if the remote node requires authentication.

Fallback Behavior¶

Condition	Behavior
Load balancing disabled	Serve locally only; return 503 if the model is not loaded.
Load balancing enabled, model available locally	Serve locally.
Load balancing enabled, model on remote node	Forward to remote node.
Load balancing enabled, model not found anywhere	Return 503 Service Unavailable.
Remote node unreachable, failover enabled	Retry once on an alternate node.
Remote node unreachable, failover disabled	Return the upstream error to the client.
Node circuit open	Skip node during routing; use alternate if available.

Multi-Node Overview — Introduction to connecting nodes
Manual Registration — Add remote nodes
Auto-Discovery — Discover nodes on the local network
Team AI with Multi-Node — Team setup guide