8.1. Supervisor Agent¶
The Supervisor Agent is a centralized decision-making system that continuously monitors the health and resource usage of all running models, and takes coordinated actions to ensure reliability and efficiency. It automatically restarts failed models, evicts idle ones to free resources, and even predicts potential issues before they occur.
Concept¶
Backend.AI GO can run multiple models simultaneously, each consuming GPU memory, system RAM, and CPU. Without supervision, a crashed model might go unnoticed, or idle models might waste valuable resources indefinitely.
The Supervisor Agent solves this by introducing a three-layer architecture:
┌─────────────────────────────────────────────────────┐
│ Intelligence Layer │
│ Predictive Analytics · Adaptive Tuning · Webhooks │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│ Decision Layer │
│ Supervisor Agent · Policy Engine · Resource Arbiter │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│ Foundation Layer │
│ Event Bus · Heartbeat Monitor · Hook Registry │
└─────────────────────────────────────────────────────┘
- Foundation Layer: Detects problems — heartbeat checks, idle tracking, and lifecycle events.
- Decision Layer: Decides what to do — evaluates policies, resolves conflicts, and executes actions.
- Intelligence Layer: Anticipates issues — usage forecasting, adaptive parameter tuning, and external notifications.
Getting Started¶
Enabling the Supervisor¶
- Go to Settings > Supervisor.
- Toggle the Supervisor Agent switch to enable it.
- Select a Configuration Preset (see below).
Once enabled, the Supervisor begins its multi-tier decision loop, checking model health and system resources at regular intervals.
Configuration Presets¶
Three built-in presets let you quickly tune the Supervisor's behavior:
| Setting | Conservative | Balanced | Aggressive |
|---|---|---|---|
| Fast tick interval | 5s | 3s | 2s |
| Medium tick interval | 30s | 15s | 10s |
| Slow tick interval | 120s | 60s | 30s |
| Memory pressure threshold | 95% | 90% | 80% |
| GPU pressure threshold | 95% | 90% | 80% |
| Idle timeout | 60 min | 30 min | 10 min |
| Max restart attempts | 1 | 3 | 5 |
| Idle eviction | Off | On | On |
| Resource optimization | Off | On | On |
- Conservative: Minimal intervention. Only restarts failed models and protects pinned models. Best for stable, long-running setups.
- Balanced: Moderate resource management with idle eviction and optimization enabled. Recommended for most users.
- Aggressive: Proactive resource reclamation. Short idle timeouts, lower pressure thresholds, and more restart attempts. Best for resource-constrained machines.
Choosing a Preset
Start with Balanced for everyday use. Switch to Conservative if you prefer manual control, or Aggressive on machines with limited GPU memory.
Health Monitoring¶
The Supervisor runs a per-model heartbeat monitor that periodically checks whether each loaded model's inference server is responding.
Health States¶
Each model transitions through four health states:
- Healthy: The model responds to health checks normally.
- Degraded: Several consecutive health checks have failed, but the model hasn't reached the dead threshold yet.
- Dead: The model has stopped responding entirely.
When a model transitions to Dead, the Supervisor can automatically restart it (up to the configured maximum attempts).
Pinning Models¶
You can pin a model to prevent it from being evicted by idle tracking or resource optimization policies. Pinned models are always kept running and will be auto-restarted on failure regardless of other policies.
To pin a model, right-click it in the model list and select Pin Model, or use the API:
curl -X POST http://localhost:8090/api/v1/lifecycle/pin \
-H "Content-Type: application/json" \
-d '{"model_id": "my-important-model"}'
Idle Tracking¶
The Supervisor tracks when each model was last used (via inference requests). Models that exceed the configured idle timeout are candidates for eviction, freeing up GPU memory and RAM for other models.
- Automatic touch: Every inference request automatically updates the model's last-activity timestamp — no manual configuration needed.
- Pinned models are exempt: Pinned models are never evicted due to inactivity.
Policy Engine¶
The Supervisor uses a priority-based Policy Engine to decide what actions to take. Five built-in policies are evaluated on every decision cycle:
| Priority | Policy | Purpose |
|---|---|---|
| 0 (highest) | Safety | Prevent out-of-memory crashes and thermal shutdown |
| 1 | Availability | Keep models running by restarting failed instances |
| 2 | Pinned Model | Ensure pinned models are always available |
| 3 | Idle Eviction | Reclaim resources from inactive models |
| 4 (lowest) | Resource Optimization | Proactively optimize resource allocation |
When two policies propose conflicting actions (e.g., one wants to keep a model loaded while another wants to evict it), the higher-priority policy always wins. Every conflict is recorded in the audit log for transparency.
Audit Log¶
Every decision the Supervisor makes is recorded in a detailed Audit Log. Each entry includes:
- Timestamp and decision ID
- System snapshot at the time of the decision (loaded models, health status, resource usage)
- Policies evaluated and their proposed actions
- Conflicts resolved (which policy won and why)
- Actions taken and their outcomes (success, failed, or skipped)
Viewing the Audit Log¶
Go to Settings > Supervisor and scroll to the Recent Decisions section. Click any entry to expand its details, including the full snapshot summary, action list, and conflict records.
You can also query the audit log via the Management API:
Fallback Routing¶
The Supervisor integrates with the Continuum Router to provide automatic failover. When a primary model becomes unresponsive:
- The Supervisor detects the failure via heartbeat monitoring.
- It activates a fallback route, redirecting inference requests to a designated backup model.
- When the primary model recovers, the fallback route is deactivated and traffic returns to normal.
This ensures that your API clients experience minimal disruption, even during model failures.
Predictive Analytics¶
Advanced Feature
Predictive analytics is disabled by default. Enable it in Settings > Supervisor under the predictive configuration section.
The Supervisor can analyze historical usage patterns to make proactive decisions:
- Usage Forecasting: Predicts how many requests each model will receive in the next hour using exponential weighted moving average (EWMA) analysis.
- Failure Prediction: Detects early warning signs of instability, memory leaks, thermal throttling, or latency degradation.
- Demand-Based Preloading: Automatically loads models that are predicted to be needed soon based on usage trends.
Viewing Forecasts¶
In the Supervisor settings page, the Predictive Analytics section displays:
- A table of usage forecasts per model (predicted requests/hour, trend direction, confidence level)
- Risk indicators showing failure predictions with recommended actions
Adaptive Tuning¶
Advanced Feature
Adaptive tuning is disabled by default. Enable it in Settings > Supervisor under the adaptive configuration section.
The Supervisor can learn from its own decisions by tracking outcomes:
- Outcome Feedback: After executing an action (e.g., restarting a model), the system checks whether the action achieved its goal.
- Parameter Adjustment: Based on success rates, the Supervisor adjusts its internal parameters (within safe bounds) to improve future decisions.
- Rate-Limited Changes: Adjustments are capped at a configurable maximum percentage per cycle to prevent oscillation.
View the tuning history in Settings > Supervisor to see what parameters were adjusted, their old and new values, and the reasons for each change.
External Integrations¶
Webhooks¶
Register webhook endpoints to receive real-time notifications about Supervisor events:
- Health state changes
- Auto-restarts
- Model evictions and preloads
- Fallback activations
- Resource alerts
- Failure predictions
Each webhook delivery includes an HMAC-SHA256 signature (if a signing secret is configured) for payload verification.
To add a webhook:
- Go to Settings > Supervisor.
- Scroll to the Webhooks section.
- Enter a name, URL, and select the event types you want to receive.
- Click Add Webhook.
You can test a webhook using the Test button to verify connectivity.
Prometheus Metrics¶
Enable the Prometheus metrics endpoint to scrape Supervisor and model health data:
Available metrics include:
bgo_model_health_status— Per-model health (0=dead, 1=degraded, 2=healthy)bgo_model_request_total— Total requests per modelbgo_model_inference_latency_seconds— Inference latency percentiles (p50, p95, p99)bgo_supervisor_decisions_total— Decision count by action typebgo_resource_gpu_memory_usage_ratio— GPU memory utilizationbgo_resource_system_memory_usage_ratio— System RAM utilization
OpenTelemetry (OTLP)¶
For environments using OpenTelemetry collectors, enable OTLP export in the Supervisor configuration and specify your collector endpoint.
Lifecycle Events (SSE)¶
For headless or web clients, the Supervisor streams lifecycle events via Server-Sent Events (SSE):
Events include model loads/unloads, health changes, idle timeouts, auto-restarts, and resource pressure alerts. This allows external tools to react to system changes in real time without polling.
API Reference¶
Supervisor Endpoints¶
| Method | Endpoint | Description |
|---|---|---|
GET | /api/v1/supervisor/status | Current supervisor status and statistics |
GET | /api/v1/supervisor/config | Current configuration |
PUT | /api/v1/supervisor/config | Update configuration |
POST | /api/v1/supervisor/start | Start the supervisor |
POST | /api/v1/supervisor/stop | Stop the supervisor |
GET | /api/v1/supervisor/audit | Query audit log (supports from, to, modelId, limit params) |
GET | /api/v1/supervisor/policies | List active policies |
GET | /api/v1/supervisor/fallbacks | List fallback configurations |
GET | /api/v1/supervisor/forecast | Usage forecasts |
GET | /api/v1/supervisor/predictions | Failure predictions |
GET | /api/v1/supervisor/tuning | Adaptive tuning status |
GET | /api/v1/supervisor/webhooks | List registered webhooks |
POST | /api/v1/supervisor/webhooks | Register a new webhook |
DELETE | /api/v1/supervisor/webhooks/{id} | Remove a webhook |
POST | /api/v1/supervisor/webhooks/{id}/test | Send a test delivery |
Lifecycle Endpoints¶
| Method | Endpoint | Description |
|---|---|---|
GET | /api/v1/lifecycle/health | Health status of all models |
GET | /api/v1/lifecycle/health/{model_id} | Health status of a specific model |
GET | /api/v1/lifecycle/config | Lifecycle configuration |
PUT | /api/v1/lifecycle/config | Update lifecycle configuration |
POST | /api/v1/lifecycle/pin | Pin a model |
POST | /api/v1/lifecycle/unpin | Unpin a model |
GET | /api/v1/lifecycle/events | SSE stream of lifecycle events |
GET | /metrics | Prometheus metrics endpoint |
Fail-Safe Design¶
The Supervisor is designed with a fail-open philosophy:
- Foundation hooks remain active: The base heartbeat and lifecycle hooks from the Foundation Layer are never disabled. If the Supervisor Agent itself crashes, these hooks continue to provide basic health monitoring and auto-restart.
- Agent self-heartbeat: The Supervisor emits its own heartbeat. If the frontend or Management API detects the Supervisor's heartbeat is lost, it falls back to the Foundation Layer's hook-based behavior.
- Continuum Router fallback: Even during Supervisor downtime, the Continuum Router's fallback model configuration ensures that API requests continue to be served.
This layered approach ensures that no single component failure can take down the entire system.