9.4. Building Apps with the API¶

Backend.AI GO is more than a standalone chat application—it's a complete AI API server that can power your own applications. By connecting your app to the Continuum Router, you can build tools that leverage both local models (for privacy and offline use) and cloud models (for cutting-edge capabilities) through a single, unified interface.

This guide shows you how to integrate Backend.AI GO's API into your applications using popular programming languages and frameworks.

What You Can Build¶

The Continuum Router's OpenAI-compatible API opens up a wide range of application possibilities:

Privacy-first chatbots: Build customer support bots that keep sensitive conversations on-device using local models.
Hybrid document analyzers: Process confidential documents locally with Llama, then route complex analytical queries to cloud models like GPT-5 or Claude.
Content generation tools: Create writing assistants, code generators, or creative tools that give users control over which models to use.
Research platforms: Build academic or data science tools that need to work offline or behind a firewall.
Multi-device workflows: Run Backend.AI GO on your desktop and access models from your phone, tablet, or other devices on your local network.

How It Works¶

API Routing

graph LR
    A[Your Application] -->|OpenAI API| B[Continuum Router]
    B -->|Route| C[Local Model<br/>llama-server]
    B -->|Route| D[Cloud Model<br/>OpenAI/Anthropic/Gemini]
    C -->|Response| B
    D -->|Response| B
    B -->|Response| A

Your Application makes HTTP requests using the OpenAI API format (chat completions, models list, etc.).
Continuum Router receives requests on its OpenAI-compatible endpoint and routes them to the appropriate backend.
Requests are fulfilled by local models (via llama-server or mlx-server) or cloud models (via configured API providers).
Responses flow back through the router in OpenAI API format.

Since the API is OpenAI-compatible, you can use the official OpenAI SDKs or any library that supports OpenAI, simply by changing the base_url parameter to point to your local Backend.AI GO instance.

Prerequisites¶

Before you begin, make sure you have:

Backend.AI GO installed and running
At least one model available—either a local model loaded in Backend.AI GO or a cloud provider configured (see Cloud Integration)
A development environment with Python 3.9+, Node.js 18+, or a tool for making HTTP requests (curl, Postman, etc.)

Step 1: Enable the API¶

API page - General

Backend.AI GO provides two ways to access the API:

Option A: Internal API (Same Machine)¶

The Internal API runs automatically when Backend.AI GO starts and is accessible only from your local machine. This is ideal for development and testing.

Endpoint: http://localhost:8000/v1
No configuration needed — it's always running by default
No authentication required for local access

Option B: TCP Server (External Access)¶

The TCP Server allows other devices on your local network to access the API. Enable this if you want to:

Access models from your phone, tablet, or other computers
Deploy Backend.AI GO on a central server and connect from multiple client machines
Build multi-device applications

TCP Server enable dialog

To enable external access:

Go to the API page in Backend.AI GO.
Enable the TCP Server toggle.
Note the port number displayed (default is 38080).
Find your machine's IP address (e.g., 192.168.1.100).
Use http://<your-ip>:38080/v1 as the base URL.

Security Tip

The TCP Server is designed for trusted local networks. Only enable it if you trust all devices on your network. For production deployments with external access, configure authentication in Settings > Network.

Port Configuration

The OpenAI-compatible API endpoint always uses port 8000 for internal access. The TCP Server uses a separate configurable port (default 38080) for external access. Both serve the same API.

Step 2: Verify the API¶

Before integrating with your application, verify that the API is accessible and working.

API Health & Timeouts

List Available Models¶

curl http://localhost:8000/v1/models

Expected Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3-8b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    },
    {
      "id": "gpt-5.1",
      "object": "model",
      "created": 1704067200,
      "owned_by": "openai"
    }
  ]
}

Test a Chat Completion¶

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Expected Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 8,
    "total_tokens": 23
  }
}

If you see these responses, the API is working correctly!

Step 3: Building with Python¶

Python is one of the most popular languages for AI applications. Backend.AI GO works seamlessly with the official OpenAI Python SDK.

Install the SDK¶

pip install openai

Basic Integration¶

Chat CompletionStreamingFunction Calling

from openai import OpenAI

# Point the SDK to your local Backend.AI GO instance
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Authentication optional for local access
)

# Send a chat completion request
response = client.chat.completions.create(
    model="llama-3-8b",  # Use any model available in Backend.AI GO
    messages=[
        {"role": "user", "content": "Write a haiku about AI"}
    ]
)

print(response.choices[0].message.content)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Stream the response token by token
stream = client.chat.completions.create(
    model="llama-3-8b",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Define functions the model can call
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-5.1",  # Function calling works best with cloud models
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools
)

# Check if the model wants to call a function
message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        print(f"Model wants to call: {function_name}({arguments})")

Building a Simple Chatbot¶

from openai import OpenAI

def chatbot():
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="not-needed"
    )

    messages = []
    print("Chatbot started. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break

        messages.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="llama-3-8b",
            messages=messages
        )

        assistant_message = response.choices[0].message.content
        messages.append({"role": "assistant", "content": assistant_message})

        print(f"Bot: {assistant_message}\n")

if __name__ == "__main__":
    chatbot()

Step 4: Building with JavaScript/TypeScript¶

For web applications and Node.js backends, use the OpenAI JavaScript SDK.

Install the SDK¶

npm install openai

Basic Integration¶

Chat CompletionStreaming

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed', // Authentication optional for local access
});

async function chat() {
  const response = await client.chat.completions.create({
    model: 'llama-3-8b',
    messages: [
      { role: 'user', content: 'Write a haiku about AI' }
    ],
  });

  console.log(response.choices[0].message.content);
}

chat();

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed',
});

async function streamChat() {
  const stream = await client.chat.completions.create({
    model: 'llama-3-8b',
    messages: [
      { role: 'user', content: 'Explain quantum computing in simple terms' }
    ],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    process.stdout.write(content);
  }
}

streamChat();

Building a Web API Server¶

import express from 'express';
import OpenAI from 'openai';

const app = express();
app.use(express.json());

const client = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed',
});

app.post('/api/chat', async (req, res) => {
  try {
    const { message, model = 'llama-3-8b' } = req.body;

    const response = await client.chat.completions.create({
      model,
      messages: [{ role: 'user', content: message }],
    });

    res.json({
      success: true,
      response: response.choices[0].message.content,
    });
  } catch (error) {
    const errorMessage = error instanceof Error ? error.message : 'Unknown error';
    res.status(500).json({
      success: false,
      error: errorMessage,
    });
  }
});

app.listen(3000, () => {
  console.log('Server running on http://localhost:3000');
});

Step 5: Building with curl/REST¶

For testing, automation scripts, or languages without an OpenAI SDK, you can use direct HTTP requests.

Chat Completion¶

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming¶

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }' \
  --no-buffer

List Models¶

curl http://localhost:8000/v1/models

Using with jq for JSON Parsing¶

# Get just the model response
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hi"}]
  }' | jq -r '.choices[0].message.content'

# List all available models
curl -s http://localhost:8000/v1/models | jq '.data[].id'

Hybrid Workflows: Combining Local and Cloud Models¶

API Mesh Network Topology

One of Backend.AI GO's most powerful features is the ability to mix local and cloud models in the same application. This enables hybrid workflows that balance privacy, cost, and capability.

Example: Privacy-First Content Moderation¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

def moderate_and_respond(user_message: str) -> str:
    # Step 1: Check content locally for privacy
    moderation_prompt = f"""Analyze this message for policy violations:

    Message: {user_message}

    Respond with only 'SAFE' or 'UNSAFE'."""

    moderation = client.chat.completions.create(
        model="llama-3-8b",  # Use local model for private data
        messages=[{"role": "user", "content": moderation_prompt}]
    )

    if "UNSAFE" in moderation.choices[0].message.content:
        return "I cannot respond to that request."

    # Step 2: If safe, use cloud model for high-quality response
    response = client.chat.completions.create(
        model="gpt-5.1",  # Use cloud model for better quality
        messages=[{"role": "user", "content": user_message}]
    )

    return response.choices[0].message.content

# Usage
result = moderate_and_respond("Tell me about quantum computing")
print(result)

Example: Cost Optimization¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

def smart_completion(prompt: str, complexity: str = "low") -> str:
    """
    Route requests to local or cloud models based on complexity.
    """
    if complexity == "low":
        # Use free local model for simple tasks
        model = "llama-3-8b"
    elif complexity == "medium":
        # Use fast cloud model for moderate complexity
        model = "gpt-5.1-mini"
    else:
        # Use powerful cloud model only when necessary
        model = "gpt-5.1"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

# Simple query - uses local model (free)
print(smart_completion("What is 2+2?", complexity="low"))

# Complex query - uses cloud model (paid)
print(smart_completion("Analyze the implications of quantum computing on cryptography", complexity="high"))

Example: Fallback Strategy¶

from openai import OpenAI
import httpx

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
    timeout=httpx.Timeout(10.0, connect=5.0)  # 10s total, 5s connect
)

def resilient_completion(prompt: str) -> str:
    """
    Try local model first, fall back to cloud if unavailable.
    """
    models_to_try = [
        "llama-3-8b",      # Try local model first
        "gpt-5.1",         # Fall back to cloud if local fails
    ]

    for model in models_to_try:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"Model {model} failed: {e}")
            continue

    return "All models unavailable. Please try again later."

result = resilient_completion("Explain neural networks")
print(result)

Troubleshooting¶

Problem	Solution
`Connection refused`	Make sure Backend.AI GO is running. Check that the internal API is active (port `8000`) or the TCP server is enabled for external access.
`Model not found`	Verify the model name matches exactly what's listed in Backend.AI GO's Models page. For local models, ensure the model is loaded. For cloud models, verify the provider is configured in Cloud Integration.
Slow responses	Local models depend on your hardware. Try a smaller model, enable GPU acceleration in Engine settings, or use a cloud model for faster responses.
Authentication errors	If you enabled authentication in Settings > Network, pass the configured API key in the `Authorization` header: `Bearer your-api-key`. For local access without authentication, set `api_key="not-needed"`.
Streaming not working	Ensure your HTTP client supports streaming and doesn't buffer the response. In curl, use `--no-buffer`. In Python, iterate over the stream immediately.
External access not working	Verify the TCP server is enabled in the API page. Check that your firewall allows connections on the configured port. Use your machine's local IP address (e.g., `192.168.1.100`), not `localhost`.

Continuum Router & API — Technical details on the API gateway
Cloud Integration — Set up cloud model providers
Running Models — Load and manage local models
Using Claude Code — Use Claude Code CLI with Backend.AI GO

9.4. Building Apps with the API¶

What You Can Build¶

How It Works¶

Prerequisites¶

Step 1: Enable the API¶

Option A: Internal API (Same Machine)¶

Option B: TCP Server (External Access)¶

Step 2: Verify the API¶

List Available Models¶

Test a Chat Completion¶

Step 3: Building with Python¶

Install the SDK¶

Basic Integration¶

Building a Simple Chatbot¶

Step 4: Building with JavaScript/TypeScript¶

Install the SDK¶

Basic Integration¶

Building a Web API Server¶

Step 5: Building with curl/REST¶

Chat Completion¶

Streaming¶

List Models¶

Using with jq for JSON Parsing¶

Hybrid Workflows: Combining Local and Cloud Models¶

Example: Privacy-First Content Moderation¶

Example: Cost Optimization¶

Example: Fallback Strategy¶

Troubleshooting¶

Related Pages¶