9.4. Building Apps with the API¶
Backend.AI GO is more than a standalone chat application—it's a complete AI API server that can power your own applications. By connecting your app to the Continuum Router, you can build tools that leverage both local models (for privacy and offline use) and cloud models (for cutting-edge capabilities) through a single, unified interface.
This guide shows you how to integrate Backend.AI GO's API into your applications using popular programming languages and frameworks.
What You Can Build¶
The Continuum Router's OpenAI-compatible API opens up a wide range of application possibilities:
- Privacy-first chatbots: Build customer support bots that keep sensitive conversations on-device using local models.
- Hybrid document analyzers: Process confidential documents locally with Llama, then route complex analytical queries to cloud models like GPT-5 or Claude.
- Content generation tools: Create writing assistants, code generators, or creative tools that give users control over which models to use.
- Research platforms: Build academic or data science tools that need to work offline or behind a firewall.
- Multi-device workflows: Run Backend.AI GO on your desktop and access models from your phone, tablet, or other devices on your local network.
How It Works¶

graph LR
A[Your Application] -->|OpenAI API| B[Continuum Router]
B -->|Route| C[Local Model<br/>llama-server]
B -->|Route| D[Cloud Model<br/>OpenAI/Anthropic/Gemini]
C -->|Response| B
D -->|Response| B
B -->|Response| A - Your Application makes HTTP requests using the OpenAI API format (chat completions, models list, etc.).
- Continuum Router receives requests on its OpenAI-compatible endpoint and routes them to the appropriate backend.
- Requests are fulfilled by local models (via llama-server or mlx-server) or cloud models (via configured API providers).
- Responses flow back through the router in OpenAI API format.
Since the API is OpenAI-compatible, you can use the official OpenAI SDKs or any library that supports OpenAI, simply by changing the base_url parameter to point to your local Backend.AI GO instance.
Prerequisites¶
Before you begin, make sure you have:
- Backend.AI GO installed and running
- At least one model available—either a local model loaded in Backend.AI GO or a cloud provider configured (see Cloud Integration)
- A development environment with Python 3.9+, Node.js 18+, or a tool for making HTTP requests (curl, Postman, etc.)
Step 1: Enable the API¶

Backend.AI GO provides two ways to access the API:
Option A: Internal API (Same Machine)¶
The Internal API runs automatically when Backend.AI GO starts and is accessible only from your local machine. This is ideal for development and testing.
- Endpoint:
http://localhost:8000/v1 - No configuration needed — it's always running by default
- No authentication required for local access
Option B: TCP Server (External Access)¶
The TCP Server allows other devices on your local network to access the API. Enable this if you want to:
- Access models from your phone, tablet, or other computers
- Deploy Backend.AI GO on a central server and connect from multiple client machines
- Build multi-device applications

To enable external access:
- Go to the API page in Backend.AI GO.
- Enable the TCP Server toggle.
- Note the port number displayed (default is
38080). - Find your machine's IP address (e.g.,
192.168.1.100). - Use
http://<your-ip>:38080/v1as the base URL.
Security Tip
The TCP Server is designed for trusted local networks. Only enable it if you trust all devices on your network. For production deployments with external access, configure authentication in Settings > Network.
Port Configuration
The OpenAI-compatible API endpoint always uses port 8000 for internal access. The TCP Server uses a separate configurable port (default 38080) for external access. Both serve the same API.
Step 2: Verify the API¶
Before integrating with your application, verify that the API is accessible and working.

List Available Models¶
Expected Response:
{
"object": "list",
"data": [
{
"id": "llama-3-8b",
"object": "model",
"created": 1704067200,
"owned_by": "local"
},
{
"id": "gpt-5.1",
"object": "model",
"created": 1704067200,
"owned_by": "openai"
}
]
}
Test a Chat Completion¶
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Expected Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "llama-3-8b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 8,
"total_tokens": 23
}
}
If you see these responses, the API is working correctly!
Step 3: Building with Python¶
Python is one of the most popular languages for AI applications. Backend.AI GO works seamlessly with the official OpenAI Python SDK.
Install the SDK¶
Basic Integration¶
from openai import OpenAI
# Point the SDK to your local Backend.AI GO instance
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Authentication optional for local access
)
# Send a chat completion request
response = client.chat.completions.create(
model="llama-3-8b", # Use any model available in Backend.AI GO
messages=[
{"role": "user", "content": "Write a haiku about AI"}
]
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Stream the response token by token
stream = client.chat.completions.create(
model="llama-3-8b",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
from openai import OpenAI
import json
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Define functions the model can call
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-5.1", # Function calling works best with cloud models
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools
)
# Check if the model wants to call a function
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {function_name}({arguments})")
Building a Simple Chatbot¶
from openai import OpenAI
def chatbot():
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
messages = []
print("Chatbot started. Type 'quit' to exit.\n")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="llama-3-8b",
messages=messages
)
assistant_message = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_message})
print(f"Bot: {assistant_message}\n")
if __name__ == "__main__":
chatbot()
Step 4: Building with JavaScript/TypeScript¶
For web applications and Node.js backends, use the OpenAI JavaScript SDK.
Install the SDK¶
Basic Integration¶
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8000/v1',
apiKey: 'not-needed', // Authentication optional for local access
});
async function chat() {
const response = await client.chat.completions.create({
model: 'llama-3-8b',
messages: [
{ role: 'user', content: 'Write a haiku about AI' }
],
});
console.log(response.choices[0].message.content);
}
chat();
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8000/v1',
apiKey: 'not-needed',
});
async function streamChat() {
const stream = await client.chat.completions.create({
model: 'llama-3-8b',
messages: [
{ role: 'user', content: 'Explain quantum computing in simple terms' }
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content);
}
}
streamChat();
Building a Web API Server¶
import express from 'express';
import OpenAI from 'openai';
const app = express();
app.use(express.json());
const client = new OpenAI({
baseURL: 'http://localhost:8000/v1',
apiKey: 'not-needed',
});
app.post('/api/chat', async (req, res) => {
try {
const { message, model = 'llama-3-8b' } = req.body;
const response = await client.chat.completions.create({
model,
messages: [{ role: 'user', content: message }],
});
res.json({
success: true,
response: response.choices[0].message.content,
});
} catch (error) {
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
res.status(500).json({
success: false,
error: errorMessage,
});
}
});
app.listen(3000, () => {
console.log('Server running on http://localhost:3000');
});
Step 5: Building with curl/REST¶
For testing, automation scripts, or languages without an OpenAI SDK, you can use direct HTTP requests.
Chat Completion¶
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}'
Streaming¶
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}' \
--no-buffer
List Models¶
Using with jq for JSON Parsing¶
# Get just the model response
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-8b",
"messages": [{"role": "user", "content": "Hi"}]
}' | jq -r '.choices[0].message.content'
# List all available models
curl -s http://localhost:8000/v1/models | jq '.data[].id'
Hybrid Workflows: Combining Local and Cloud Models¶

One of Backend.AI GO's most powerful features is the ability to mix local and cloud models in the same application. This enables hybrid workflows that balance privacy, cost, and capability.
Example: Privacy-First Content Moderation¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
def moderate_and_respond(user_message: str) -> str:
# Step 1: Check content locally for privacy
moderation_prompt = f"""Analyze this message for policy violations:
Message: {user_message}
Respond with only 'SAFE' or 'UNSAFE'."""
moderation = client.chat.completions.create(
model="llama-3-8b", # Use local model for private data
messages=[{"role": "user", "content": moderation_prompt}]
)
if "UNSAFE" in moderation.choices[0].message.content:
return "I cannot respond to that request."
# Step 2: If safe, use cloud model for high-quality response
response = client.chat.completions.create(
model="gpt-5.1", # Use cloud model for better quality
messages=[{"role": "user", "content": user_message}]
)
return response.choices[0].message.content
# Usage
result = moderate_and_respond("Tell me about quantum computing")
print(result)
Example: Cost Optimization¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
def smart_completion(prompt: str, complexity: str = "low") -> str:
"""
Route requests to local or cloud models based on complexity.
"""
if complexity == "low":
# Use free local model for simple tasks
model = "llama-3-8b"
elif complexity == "medium":
# Use fast cloud model for moderate complexity
model = "gpt-5.1-mini"
else:
# Use powerful cloud model only when necessary
model = "gpt-5.1"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Simple query - uses local model (free)
print(smart_completion("What is 2+2?", complexity="low"))
# Complex query - uses cloud model (paid)
print(smart_completion("Analyze the implications of quantum computing on cryptography", complexity="high"))
Example: Fallback Strategy¶
from openai import OpenAI
import httpx
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
timeout=httpx.Timeout(10.0, connect=5.0) # 10s total, 5s connect
)
def resilient_completion(prompt: str) -> str:
"""
Try local model first, fall back to cloud if unavailable.
"""
models_to_try = [
"llama-3-8b", # Try local model first
"gpt-5.1", # Fall back to cloud if local fails
]
for model in models_to_try:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except Exception as e:
print(f"Model {model} failed: {e}")
continue
return "All models unavailable. Please try again later."
result = resilient_completion("Explain neural networks")
print(result)
Troubleshooting¶
| Problem | Solution |
|---|---|
Connection refused | Make sure Backend.AI GO is running. Check that the internal API is active (port 8000) or the TCP server is enabled for external access. |
Model not found | Verify the model name matches exactly what's listed in Backend.AI GO's Models page. For local models, ensure the model is loaded. For cloud models, verify the provider is configured in Cloud Integration. |
| Slow responses | Local models depend on your hardware. Try a smaller model, enable GPU acceleration in Engine settings, or use a cloud model for faster responses. |
| Authentication errors | If you enabled authentication in Settings > Network, pass the configured API key in the Authorization header: Bearer your-api-key. For local access without authentication, set api_key="not-needed". |
| Streaming not working | Ensure your HTTP client supports streaming and doesn't buffer the response. In curl, use --no-buffer. In Python, iterate over the stream immediately. |
| External access not working | Verify the TCP server is enabled in the API page. Check that your firewall allows connections on the configured port. Use your machine's local IP address (e.g., 192.168.1.100), not localhost. |
Related Pages¶
- Continuum Router & API — Technical details on the API gateway
- Cloud Integration — Set up cloud model providers
- Running Models — Load and manage local models
- Using Claude Code — Use Claude Code CLI with Backend.AI GO