9.6. Remote vLLM Integration¶

While Backend.AI GO can run vLLM locally, you can also connect to a remote vLLM server. This is ideal if you have a centralized high-VRAM GPU server and want to use your laptop as a client.

Why Remote vLLM?¶

Resource Sharing: One GPU server can serve multiple team members.
Massive Models: Run Solar-Open-100B, gpt-oss-120B, or Qwen3.5-122B-A10B models that require multiple H100/H200 or next-gen B200/B300 GPUs, accessible from your MacBook Air.
Battery Life: Offload heavy computation to keep your laptop cool and long-lasting.

Setup Requirements¶

On your remote server, launch vLLM with the OpenAI-compatible server enabled:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-Instruct \
  --port 8000 \
  --api-key "my-secret-key"

Setup in Backend.AI GO¶

Go to Providers.
Click Add Provider and select vLLM.
Enter the Connection Details:
- Name: A friendly name (e.g., "Lab Server H100 Cluster").
- Base URL: http://<server-ip>:8000/v1
- API Key: my-secret-key (as set in the command above).
Connect: Backend.AI GO will verify the connection and list the loaded model(s).

Automatic Capability Detection¶

When Backend.AI GO connects to a vLLM server, it runs two background probes:

Tool-call probe: Sends one minimal request with tool_choice:"auto" to check whether the server was started with --enable-auto-tool-choice and --tool-call-parser. If it was, tool calling is enabled for that model. If not, tools are silently omitted so the server never receives a request it would reject with HTTP 400. The model's capability tooltip shows the functionalProbe detection signal once the probe completes.
Context-length probe: Reads max_model_len from the /v1/models list and uses that value to cap max_tokens on every chat request, preventing "maxtokens exceeds maxmodel_len" errors on small-context deployments.

Both probes are cached with the provider record and re-run on the next capability refresh. Manual capability overrides (via the model's settings panel) skip the probes entirely.

If your vLLM instance supports tool calling but the probe returns Unsupported, verify that the server was started with both flags. A vLLM started without --tool-call-parser will correctly report that it does not accept tool_choice:"auto".

Difference from OpenAI Compatible¶

While you could connect to vLLM using the "OpenAI Compatible" provider type, using the dedicated vLLM provider type offers advantages:

Optimized Handling: Backend.AI GO knows it's talking to vLLM and can handle specific tokenization or prompt formatting quirks better.
Metrics: Future updates may allow fetching server metrics (GPU usage, queue length) specifically from vLLM endpoints.

Connect to raw compute power anywhere.