Skip to content

Model Settings & Parameters

Backend.AI GO offers a wide range of configuration options to fine-tune how models are loaded and how they generate text. This page details all available settings.

Loading Parameters

These settings apply when you load a model into memory. They determine hardware usage and basic model capabilities.

Hardware Acceleration

  • GPU Layers: Determines how many layers of the model are offloaded to your GPU.
    • Max (All): Recommended for best performance if VRAM allows.
    • Partial: Use if you have limited VRAM. The rest runs on CPU.
    • 0: Runs entirely on CPU (slower).
  • Main GPU: If you have multiple GPUs, selects which one to use as the primary device.
  • Split Mode: For multi-GPU setups, determines how the model is split across devices (e.g., Row Split, Layer Split).

Memory & Context

  • Context Length: The maximum amount of text (tokens) the model can remember in a single conversation.
    • Note: Higher context requires more VRAM/RAM.
    • Default: Usually 2048 or 4096 depending on the model.
  • Batch Size: The number of tokens processed in parallel during prompt evaluation. Higher values speed up processing long prompts but use more VRAM.
  • Flash Attention: Enables memory-efficient attention mechanism (requires compatible hardware).

Performance

  • Threads: The number of CPU threads to use for inference (relevant when not fully offloaded to GPU).
  • NUMA Support: Optimizations for multi-socket CPU systems.

Generation Parameters

These settings (often found in the "Parameters" drawer during chat) control the creativity and behavior of the model's responses.

Creativity & Randomness

  • Temperature: Controls the randomness of the output.
    • Low (0.1 - 0.5): Focused, deterministic, logical. Good for coding and factual tasks.
    • High (0.8 - 1.5): Creative, unpredictable. Good for storytelling.
  • Top P (Nucleus Sampling): Limits the next token selection to a subset of most probable tokens.
  • Top K: Limits the next token selection to the top K most probable tokens.
  • Min P: Sets a minimum probability threshold relative to the most probable token.

Repetition Control

  • Repeat Penalty: Penalizes words that have already appeared in the text to prevent looping.
  • Presence Penalty: Penalizes tokens based on whether they have appeared at all.
  • Frequency Penalty: Penalizes tokens based on how many times they have appeared.

Structural Control

  • System Prompt: A high-level instruction that defines the model's persona and constraints (e.g., "You are a helpful coding assistant").
  • Stop Strings: Specific sequences of text that will cause the model to stop generating immediately.
  • Max Tokens: The hard limit on the number of tokens the model can generate in a single response.

Reasoning Effort

For models that support extended thinking (reasoning models), you can control the depth of reasoning:

  • Default Reasoning Effort: Configure in Settings > Inference. Options include:

    • None (Off): No extended thinking
    • Low / Medium / High / Extra High: Progressively deeper reasoning
    • Last Used: Remembers and applies the most recently used setting
  • Per-Session Override: Adjust reasoning effort directly in the chat interface for individual sessions.