Skip to content

9.4. Building Apps with the API

Backend.AI GO is more than a standalone chat application—it's a complete AI API server that can power your own applications. By connecting your app to the Continuum Router, you can build tools that leverage both local models (for privacy and offline use) and cloud models (for cutting-edge capabilities) through a single, unified interface.

This guide shows you how to integrate Backend.AI GO's API into your applications using popular programming languages and frameworks.

What You Can Build

The Continuum Router's OpenAI-compatible API opens up a wide range of application possibilities:

  • Privacy-first chatbots: Build customer support bots that keep sensitive conversations on-device using local models.
  • Hybrid document analyzers: Process confidential documents locally with Llama, then route complex analytical queries to cloud models like GPT-5 or Claude.
  • Content generation tools: Create writing assistants, code generators, or creative tools that give users control over which models to use.
  • Research platforms: Build academic or data science tools that need to work offline or behind a firewall.
  • Multi-device workflows: Run Backend.AI GO on your desktop and access models from your phone, tablet, or other devices on your local network.

How It Works

API Routing API Routing

graph LR
    A[Your Application] -->|OpenAI API| B[Continuum Router]
    B -->|Route| C[Local Model<br/>llama-server]
    B -->|Route| D[Cloud Model<br/>OpenAI/Anthropic/Gemini]
    C -->|Response| B
    D -->|Response| B
    B -->|Response| A
  • Your Application makes HTTP requests using the OpenAI API format (chat completions, models list, etc.).
  • Continuum Router receives requests on its OpenAI-compatible endpoint and routes them to the appropriate backend.
  • Requests are fulfilled by local models (via llama-server or mlx-server) or cloud models (via configured API providers).
  • Responses flow back through the router in OpenAI API format.

Since the API is OpenAI-compatible, you can use the official OpenAI SDKs or any library that supports OpenAI, simply by changing the base_url parameter to point to your local Backend.AI GO instance.

Prerequisites

Before you begin, make sure you have:

  • Backend.AI GO installed and running
  • At least one model available—either a local model loaded in Backend.AI GO or a cloud provider configured (see Cloud Integration)
  • A development environment with Python 3.9+, Node.js 18+, or a tool for making HTTP requests (curl, Postman, etc.)

Step 1: Enable the API

API page - General API page - General

Backend.AI GO provides two ways to access the API:

Option A: Internal API (Same Machine)

The Internal API runs automatically when Backend.AI GO starts and is accessible only from your local machine. This is ideal for development and testing.

  • Endpoint: http://localhost:8000/v1
  • No configuration needed — it's always running by default
  • No authentication required for local access

Option B: TCP Server (External Access)

The TCP Server allows other devices on your local network to access the API. Enable this if you want to:

  • Access models from your phone, tablet, or other computers
  • Deploy Backend.AI GO on a central server and connect from multiple client machines
  • Build multi-device applications

TCP Server enable dialog TCP Server enable dialog

To enable external access:

  1. Go to the API page in Backend.AI GO.
  2. Enable the TCP Server toggle.
  3. Note the port number displayed (default is 38080).
  4. Find your machine's IP address (e.g., 192.168.1.100).
  5. Use http://<your-ip>:38080/v1 as the base URL.

Security Tip

The TCP Server is designed for trusted local networks. Only enable it if you trust all devices on your network. For production deployments with external access, configure authentication in Settings > Network.

Port Configuration

The OpenAI-compatible API endpoint always uses port 8000 for internal access. The TCP Server uses a separate configurable port (default 38080) for external access. Both serve the same API.

Step 2: Verify the API

Before integrating with your application, verify that the API is accessible and working.

API Health & Timeouts API Health & Timeouts

List Available Models

curl http://localhost:8000/v1/models

Expected Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3-8b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    },
    {
      "id": "gpt-5.1",
      "object": "model",
      "created": 1704067200,
      "owned_by": "openai"
    }
  ]
}

Test a Chat Completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Expected Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 8,
    "total_tokens": 23
  }
}

If you see these responses, the API is working correctly!

Step 3: Building with Python

Python is one of the most popular languages for AI applications. Backend.AI GO works seamlessly with the official OpenAI Python SDK.

Install the SDK

pip install openai

Basic Integration

from openai import OpenAI

# Point the SDK to your local Backend.AI GO instance
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Authentication optional for local access
)

# Send a chat completion request
response = client.chat.completions.create(
    model="llama-3-8b",  # Use any model available in Backend.AI GO
    messages=[
        {"role": "user", "content": "Write a haiku about AI"}
    ]
)

print(response.choices[0].message.content)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Stream the response token by token
stream = client.chat.completions.create(
    model="llama-3-8b",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Define functions the model can call
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-5.1",  # Function calling works best with cloud models
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools
)

# Check if the model wants to call a function
message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        print(f"Model wants to call: {function_name}({arguments})")

Building a Simple Chatbot

from openai import OpenAI

def chatbot():
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="not-needed"
    )

    messages = []
    print("Chatbot started. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break

        messages.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="llama-3-8b",
            messages=messages
        )

        assistant_message = response.choices[0].message.content
        messages.append({"role": "assistant", "content": assistant_message})

        print(f"Bot: {assistant_message}\n")

if __name__ == "__main__":
    chatbot()

Step 4: Building with JavaScript/TypeScript

For web applications and Node.js backends, use the OpenAI JavaScript SDK.

Install the SDK

npm install openai

Basic Integration

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed', // Authentication optional for local access
});

async function chat() {
  const response = await client.chat.completions.create({
    model: 'llama-3-8b',
    messages: [
      { role: 'user', content: 'Write a haiku about AI' }
    ],
  });

  console.log(response.choices[0].message.content);
}

chat();
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed',
});

async function streamChat() {
  const stream = await client.chat.completions.create({
    model: 'llama-3-8b',
    messages: [
      { role: 'user', content: 'Explain quantum computing in simple terms' }
    ],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    process.stdout.write(content);
  }
}

streamChat();

Building a Web API Server

import express from 'express';
import OpenAI from 'openai';

const app = express();
app.use(express.json());

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed',
});

app.post('/api/chat', async (req, res) => {
  try {
    const { message, model = 'llama-3-8b' } = req.body;

    const response = await client.chat.completions.create({
      model,
      messages: [{ role: 'user', content: message }],
    });

    res.json({
      success: true,
      response: response.choices[0].message.content,
    });
  } catch (error) {
    const errorMessage = error instanceof Error ? error.message : 'Unknown error';
    res.status(500).json({
      success: false,
      error: errorMessage,
    });
  }
});

app.listen(3000, () => {
  console.log('Server running on http://localhost:3000');
});

Step 5: Building with curl/REST

For testing, automation scripts, or languages without an OpenAI SDK, you can use direct HTTP requests.

Chat Completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }' \
  --no-buffer

List Models

curl http://localhost:8000/v1/models

Using with jq for JSON Parsing

# Get just the model response
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hi"}]
  }' | jq -r '.choices[0].message.content'

# List all available models
curl -s http://localhost:8000/v1/models | jq '.data[].id'

Hybrid Workflows: Combining Local and Cloud Models

API Mesh Network Topology API Mesh Network Topology

One of Backend.AI GO's most powerful features is the ability to mix local and cloud models in the same application. This enables hybrid workflows that balance privacy, cost, and capability.

Example: Privacy-First Content Moderation

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

def moderate_and_respond(user_message: str) -> str:
    # Step 1: Check content locally for privacy
    moderation_prompt = f"""Analyze this message for policy violations:

    Message: {user_message}

    Respond with only 'SAFE' or 'UNSAFE'."""

    moderation = client.chat.completions.create(
        model="llama-3-8b",  # Use local model for private data
        messages=[{"role": "user", "content": moderation_prompt}]
    )

    if "UNSAFE" in moderation.choices[0].message.content:
        return "I cannot respond to that request."

    # Step 2: If safe, use cloud model for high-quality response
    response = client.chat.completions.create(
        model="gpt-5.1",  # Use cloud model for better quality
        messages=[{"role": "user", "content": user_message}]
    )

    return response.choices[0].message.content

# Usage
result = moderate_and_respond("Tell me about quantum computing")
print(result)

Example: Cost Optimization

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

def smart_completion(prompt: str, complexity: str = "low") -> str:
    """
    Route requests to local or cloud models based on complexity.
    """
    if complexity == "low":
        # Use free local model for simple tasks
        model = "llama-3-8b"
    elif complexity == "medium":
        # Use fast cloud model for moderate complexity
        model = "gpt-5.1-mini"
    else:
        # Use powerful cloud model only when necessary
        model = "gpt-5.1"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

# Simple query - uses local model (free)
print(smart_completion("What is 2+2?", complexity="low"))

# Complex query - uses cloud model (paid)
print(smart_completion("Analyze the implications of quantum computing on cryptography", complexity="high"))

Example: Fallback Strategy

from openai import OpenAI
import httpx

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    timeout=httpx.Timeout(10.0, connect=5.0)  # 10s total, 5s connect
)

def resilient_completion(prompt: str) -> str:
    """
    Try local model first, fall back to cloud if unavailable.
    """
    models_to_try = [
        "llama-3-8b",      # Try local model first
        "gpt-5.1",         # Fall back to cloud if local fails
    ]

    for model in models_to_try:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"Model {model} failed: {e}")
            continue

    return "All models unavailable. Please try again later."

result = resilient_completion("Explain neural networks")
print(result)

Troubleshooting

Problem Solution
Connection refused Make sure Backend.AI GO is running. Check that the internal API is active (port 8000) or the TCP server is enabled for external access.
Model not found Verify the model name matches exactly what's listed in Backend.AI GO's Models page. For local models, ensure the model is loaded. For cloud models, verify the provider is configured in Cloud Integration.
Slow responses Local models depend on your hardware. Try a smaller model, enable GPU acceleration in Engine settings, or use a cloud model for faster responses.
Authentication errors If you enabled authentication in Settings > Network, pass the configured API key in the Authorization header: Bearer your-api-key. For local access without authentication, set api_key="not-needed".
Streaming not working Ensure your HTTP client supports streaming and doesn't buffer the response. In curl, use --no-buffer. In Python, iterate over the stream immediately.
External access not working Verify the TCP server is enabled in the API page. Check that your firewall allows connections on the configured port. Use your machine's local IP address (e.g., 192.168.1.100), not localhost.