Skip to content

8.1. Supervisor Agent

The Supervisor Agent is a centralized decision-making system that continuously monitors the health and resource usage of all running models, and takes coordinated actions to ensure reliability and efficiency. It automatically restarts failed models, evicts idle ones to free resources, and even predicts potential issues before they occur.

Concept

Backend.AI GO can run multiple models simultaneously, each consuming GPU memory, system RAM, and CPU. Without supervision, a crashed model might go unnoticed, or idle models might waste valuable resources indefinitely.

The Supervisor Agent solves this by introducing a three-layer architecture:

┌─────────────────────────────────────────────────────┐
│  Intelligence Layer                                  │
│  Predictive Analytics · Adaptive Tuning · Webhooks   │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│  Decision Layer                                      │
│  Supervisor Agent · Policy Engine · Resource Arbiter  │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│  Foundation Layer                                     │
│  Event Bus · Heartbeat Monitor · Hook Registry        │
└─────────────────────────────────────────────────────┘
  • Foundation Layer: Detects problems — heartbeat checks, idle tracking, and lifecycle events.
  • Decision Layer: Decides what to do — evaluates policies, resolves conflicts, and executes actions.
  • Intelligence Layer: Anticipates issues — usage forecasting, adaptive parameter tuning, and external notifications.

Getting Started

Enabling the Supervisor

  1. Go to Settings > Supervisor.
  2. Toggle the Supervisor Agent switch to enable it.
  3. Select a Configuration Preset (see below).

Once enabled, the Supervisor begins its multi-tier decision loop, checking model health and system resources at regular intervals.

Configuration Presets

Three built-in presets let you quickly tune the Supervisor's behavior:

Setting Conservative Balanced Aggressive
Fast tick interval 5s 3s 2s
Medium tick interval 30s 15s 10s
Slow tick interval 120s 60s 30s
Memory pressure threshold 95% 90% 80%
GPU pressure threshold 95% 90% 80%
Idle timeout 60 min 30 min 10 min
Max restart attempts 1 3 5
Idle eviction Off On On
Resource optimization Off On On
  • Conservative: Minimal intervention. Only restarts failed models and protects pinned models. Best for stable, long-running setups.
  • Balanced: Moderate resource management with idle eviction and optimization enabled. Recommended for most users.
  • Aggressive: Proactive resource reclamation. Short idle timeouts, lower pressure thresholds, and more restart attempts. Best for resource-constrained machines.

Choosing a Preset

Start with Balanced for everyday use. Switch to Conservative if you prefer manual control, or Aggressive on machines with limited GPU memory.

Health Monitoring

The Supervisor runs a per-model heartbeat monitor that periodically checks whether each loaded model's inference server is responding.

Health States

Each model transitions through four health states:

Unknown → Healthy → Degraded → Dead
                  ↑           ↓
                  └───────────┘
                   (recovery)
  • Healthy: The model responds to health checks normally.
  • Degraded: Several consecutive health checks have failed, but the model hasn't reached the dead threshold yet.
  • Dead: The model has stopped responding entirely.

When a model transitions to Dead, the Supervisor can automatically restart it (up to the configured maximum attempts).

Pinning Models

You can pin a model to prevent it from being evicted by idle tracking or resource optimization policies. Pinned models are always kept running and will be auto-restarted on failure regardless of other policies.

To pin a model, right-click it in the model list and select Pin Model, or use the API:

curl -X POST http://localhost:8090/api/v1/lifecycle/pin \
  -H "Content-Type: application/json" \
  -d '{"model_id": "my-important-model"}'

Idle Tracking

The Supervisor tracks when each model was last used (via inference requests). Models that exceed the configured idle timeout are candidates for eviction, freeing up GPU memory and RAM for other models.

  • Automatic touch: Every inference request automatically updates the model's last-activity timestamp — no manual configuration needed.
  • Pinned models are exempt: Pinned models are never evicted due to inactivity.

Policy Engine

The Supervisor uses a priority-based Policy Engine to decide what actions to take. Five built-in policies are evaluated on every decision cycle:

Priority Policy Purpose
0 (highest) Safety Prevent out-of-memory crashes and thermal shutdown
1 Availability Keep models running by restarting failed instances
2 Pinned Model Ensure pinned models are always available
3 Idle Eviction Reclaim resources from inactive models
4 (lowest) Resource Optimization Proactively optimize resource allocation

When two policies propose conflicting actions (e.g., one wants to keep a model loaded while another wants to evict it), the higher-priority policy always wins. Every conflict is recorded in the audit log for transparency.

Audit Log

Every decision the Supervisor makes is recorded in a detailed Audit Log. Each entry includes:

  • Timestamp and decision ID
  • System snapshot at the time of the decision (loaded models, health status, resource usage)
  • Policies evaluated and their proposed actions
  • Conflicts resolved (which policy won and why)
  • Actions taken and their outcomes (success, failed, or skipped)

Viewing the Audit Log

Go to Settings > Supervisor and scroll to the Recent Decisions section. Click any entry to expand its details, including the full snapshot summary, action list, and conflict records.

You can also query the audit log via the Management API:

curl "http://localhost:8090/api/v1/supervisor/audit?limit=20"

Fallback Routing

The Supervisor integrates with the Continuum Router to provide automatic failover. When a primary model becomes unresponsive:

  1. The Supervisor detects the failure via heartbeat monitoring.
  2. It activates a fallback route, redirecting inference requests to a designated backup model.
  3. When the primary model recovers, the fallback route is deactivated and traffic returns to normal.

This ensures that your API clients experience minimal disruption, even during model failures.

Predictive Analytics

Advanced Feature

Predictive analytics is disabled by default. Enable it in Settings > Supervisor under the predictive configuration section.

The Supervisor can analyze historical usage patterns to make proactive decisions:

  • Usage Forecasting: Predicts how many requests each model will receive in the next hour using exponential weighted moving average (EWMA) analysis.
  • Failure Prediction: Detects early warning signs of instability, memory leaks, thermal throttling, or latency degradation.
  • Demand-Based Preloading: Automatically loads models that are predicted to be needed soon based on usage trends.

Viewing Forecasts

In the Supervisor settings page, the Predictive Analytics section displays:

  • A table of usage forecasts per model (predicted requests/hour, trend direction, confidence level)
  • Risk indicators showing failure predictions with recommended actions

Adaptive Tuning

Advanced Feature

Adaptive tuning is disabled by default. Enable it in Settings > Supervisor under the adaptive configuration section.

The Supervisor can learn from its own decisions by tracking outcomes:

  • Outcome Feedback: After executing an action (e.g., restarting a model), the system checks whether the action achieved its goal.
  • Parameter Adjustment: Based on success rates, the Supervisor adjusts its internal parameters (within safe bounds) to improve future decisions.
  • Rate-Limited Changes: Adjustments are capped at a configurable maximum percentage per cycle to prevent oscillation.

View the tuning history in Settings > Supervisor to see what parameters were adjusted, their old and new values, and the reasons for each change.

External Integrations

Webhooks

Register webhook endpoints to receive real-time notifications about Supervisor events:

  • Health state changes
  • Auto-restarts
  • Model evictions and preloads
  • Fallback activations
  • Resource alerts
  • Failure predictions

Each webhook delivery includes an HMAC-SHA256 signature (if a signing secret is configured) for payload verification.

To add a webhook:

  1. Go to Settings > Supervisor.
  2. Scroll to the Webhooks section.
  3. Enter a name, URL, and select the event types you want to receive.
  4. Click Add Webhook.

You can test a webhook using the Test button to verify connectivity.

Prometheus Metrics

Enable the Prometheus metrics endpoint to scrape Supervisor and model health data:

curl http://localhost:8090/metrics

Available metrics include:

  • bgo_model_health_status — Per-model health (0=dead, 1=degraded, 2=healthy)
  • bgo_model_request_total — Total requests per model
  • bgo_model_inference_latency_seconds — Inference latency percentiles (p50, p95, p99)
  • bgo_supervisor_decisions_total — Decision count by action type
  • bgo_resource_gpu_memory_usage_ratio — GPU memory utilization
  • bgo_resource_system_memory_usage_ratio — System RAM utilization

OpenTelemetry (OTLP)

For environments using OpenTelemetry collectors, enable OTLP export in the Supervisor configuration and specify your collector endpoint.

Lifecycle Events (SSE)

For headless or web clients, the Supervisor streams lifecycle events via Server-Sent Events (SSE):

curl -N http://localhost:8090/api/v1/lifecycle/events

Events include model loads/unloads, health changes, idle timeouts, auto-restarts, and resource pressure alerts. This allows external tools to react to system changes in real time without polling.

API Reference

Supervisor Endpoints

Method Endpoint Description
GET /api/v1/supervisor/status Current supervisor status and statistics
GET /api/v1/supervisor/config Current configuration
PUT /api/v1/supervisor/config Update configuration
POST /api/v1/supervisor/start Start the supervisor
POST /api/v1/supervisor/stop Stop the supervisor
GET /api/v1/supervisor/audit Query audit log (supports from, to, modelId, limit params)
GET /api/v1/supervisor/policies List active policies
GET /api/v1/supervisor/fallbacks List fallback configurations
GET /api/v1/supervisor/forecast Usage forecasts
GET /api/v1/supervisor/predictions Failure predictions
GET /api/v1/supervisor/tuning Adaptive tuning status
GET /api/v1/supervisor/webhooks List registered webhooks
POST /api/v1/supervisor/webhooks Register a new webhook
DELETE /api/v1/supervisor/webhooks/{id} Remove a webhook
POST /api/v1/supervisor/webhooks/{id}/test Send a test delivery

Lifecycle Endpoints

Method Endpoint Description
GET /api/v1/lifecycle/health Health status of all models
GET /api/v1/lifecycle/health/{model_id} Health status of a specific model
GET /api/v1/lifecycle/config Lifecycle configuration
PUT /api/v1/lifecycle/config Update lifecycle configuration
POST /api/v1/lifecycle/pin Pin a model
POST /api/v1/lifecycle/unpin Unpin a model
GET /api/v1/lifecycle/events SSE stream of lifecycle events
GET /metrics Prometheus metrics endpoint

Fail-Safe Design

The Supervisor is designed with a fail-open philosophy:

  • Foundation hooks remain active: The base heartbeat and lifecycle hooks from the Foundation Layer are never disabled. If the Supervisor Agent itself crashes, these hooks continue to provide basic health monitoring and auto-restart.
  • Agent self-heartbeat: The Supervisor emits its own heartbeat. If the frontend or Management API detects the Supervisor's heartbeat is lost, it falls back to the Foundation Layer's hook-based behavior.
  • Continuum Router fallback: Even during Supervisor downtime, the Continuum Router's fallback model configuration ensures that API requests continue to be served.

This layered approach ensures that no single component failure can take down the entire system.