6.6. Remote vLLM Integration¶

While Backend.AI GO can run vLLM locally, you can also connect to a remote vLLM server. This is ideal if you have a powerful centralized GPU server and want to use your laptop as a client.

Why Remote vLLM?¶

Resource Sharing: One powerful server can serve multiple team members.
Massive Models: Run Solar-Open-100B, gpt-oss-120B, or Qwen3-235b-a22b models that require multiple H100/H200 or next-gen B200/B300 GPUs, accessible from your MacBook Air.
Battery Life: Offload heavy computation to keep your laptop cool and long-lasting.

Setup Requirements¶

On your remote server, launch vLLM with the OpenAI-compatible server enabled:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-Instruct \
  --port 8000 \
  --api-key "my-secret-key"

Setup in Backend.AI GO¶

Go to Providers.
Click Add Provider and select vLLM.
Enter the Connection Details:
- Name: A friendly name (e.g., "Lab Server H100 Cluster").
- Base URL: http://<server-ip>:8000/v1
- API Key: my-secret-key (as set in the command above).
Connect: Backend.AI GO will verify the connection and list the loaded model(s).

Difference from OpenAI Compatible¶

While you could connect to vLLM using the "OpenAI Compatible" provider type, using the dedicated vLLM provider type offers advantages:

Optimized Handling: Backend.AI GO knows it's talking to vLLM and can handle specific tokenization or prompt formatting quirks better.
Metrics: Future updates may allow fetching server metrics (GPU usage, queue length) specifically from vLLM endpoints.

Connect to raw compute power anywhere.