Model Settings & Parameters¶
Backend.AI GO offers a wide range of configuration options to fine-tune how models are loaded and how they generate text. This page details all available settings.
Loading Parameters¶
These settings apply when you load a model into memory. They determine hardware usage and basic model capabilities.
Hardware Acceleration¶
- GPU Layers: Determines how many layers of the model are offloaded to your GPU.
- Max (All): Recommended for best performance if VRAM allows.
- Partial: Use if you have limited VRAM. The rest runs on CPU.
- 0: Runs entirely on CPU (slower).
- Main GPU: If you have multiple GPUs, selects which one to use as the primary device.
- Split Mode: For multi-GPU setups, determines how the model is split across devices (e.g., Row Split, Layer Split).
Memory & Context¶
- Context Length: The maximum amount of text (tokens) the model can remember in a single conversation.
- Note: Higher context requires more VRAM/RAM.
- Default: Usually 2048 or 4096 depending on the model.
- Batch Size: The number of tokens processed in parallel during prompt evaluation. Higher values speed up processing long prompts but use more VRAM.
- Flash Attention: Enables memory-efficient attention mechanism (requires compatible hardware).
Performance¶
- Threads: The number of CPU threads to use for inference (relevant when not fully offloaded to GPU).
- NUMA Support: Optimizations for multi-socket CPU systems.
Generation Parameters¶
These settings (often found in the "Parameters" drawer during chat) control the creativity and behavior of the model's responses.
Creativity & Randomness¶
- Temperature: Controls the randomness of the output.
- Low (0.1 - 0.5): Focused, deterministic, logical. Good for coding and factual tasks.
- High (0.8 - 1.5): Creative, unpredictable. Good for storytelling.
- Top P (Nucleus Sampling): Limits the next token selection to a subset of most probable tokens.
- Top K: Limits the next token selection to the top K most probable tokens.
- Min P: Sets a minimum probability threshold relative to the most probable token.
Repetition Control¶
- Repeat Penalty: Penalizes words that have already appeared in the text to prevent looping.
- Presence Penalty: Penalizes tokens based on whether they have appeared at all.
- Frequency Penalty: Penalizes tokens based on how many times they have appeared.
Structural Control¶
- System Prompt: A high-level instruction that defines the model's persona and constraints (e.g., "You are a helpful coding assistant").
- Stop Strings: Specific sequences of text that will cause the model to stop generating immediately.
- Max Tokens: The hard limit on the number of tokens the model can generate in a single response.
Reasoning Effort¶
For models that support extended thinking (reasoning models), you can control the depth of reasoning:
-
Default Reasoning Effort: Configure in Settings > Inference. Options include:
- None (Off): No extended thinking
- Low / Medium / High / Extra High: Progressively deeper reasoning
- Last Used: Remembers and applies the most recently used setting
-
Per-Session Override: Adjust reasoning effort directly in the chat interface for individual sessions.