2.8. Model Settings & Parameters¶

Backend.AI GO offers a wide range of configuration options to fine-tune how models are loaded and how they generate text. This page details all available settings.

Loading Parameters¶

These settings apply when you load a model into memory. They determine hardware usage and basic model capabilities.

Hardware Acceleration¶

GPU Layers: Determines how many layers of the model are offloaded to your GPU.
- Max (All): Recommended for best performance if VRAM allows.
- Partial: Use if you have limited VRAM. The rest runs on CPU.
- 0: Runs entirely on CPU (slower).
Main GPU: If you have multiple GPUs, selects which one to use as the primary device.
Split Mode: For multi-GPU setups, determines how the model is split across devices (e.g., Row Split, Layer Split).

Memory & Context¶

Context Length: The maximum amount of text (tokens) the model can remember in a single conversation.
- Note: Higher context requires more VRAM/RAM.
- Default: Usually 2048 or 4096 depending on the model.
Batch Size: The number of tokens processed in parallel during prompt evaluation. Higher values speed up processing long prompts but use more VRAM.
Flash Attention: Enables memory-efficient attention mechanism (requires compatible hardware).

Performance¶

Threads: The number of CPU threads to use for inference (relevant when not fully offloaded to GPU).
NUMA Support: Optimizations for multi-socket CPU systems.

Generation Parameters¶

These settings (often found in the "Parameters" drawer during chat) control the creativity and behavior of the model's responses.

Creativity & Randomness¶

Temperature: Controls the randomness of the output.
- Low (0.1 - 0.5): Focused, deterministic, logical. Good for coding and factual tasks.
- High (0.8 - 1.5): Creative, unpredictable. Good for storytelling.
Top P (Nucleus Sampling): Limits the next token selection to a subset of most probable tokens.
Top K: Limits the next token selection to the top K most probable tokens.
Min P: Sets a minimum probability threshold relative to the most probable token.

Repetition Control¶

Repeat Penalty: Penalizes words that have already appeared in the text to prevent looping.
Presence Penalty: Penalizes tokens based on whether they have appeared at all.
Frequency Penalty: Penalizes tokens based on how many times they have appeared.

Structural Control¶

System Prompt: A high-level instruction that defines the model's persona and constraints (e.g., "You are a helpful coding assistant").
Stop Strings: Specific sequences of text that will cause the model to stop generating immediately.
Max Tokens: The hard limit on the number of tokens the model can generate in a single response.

Reasoning Effort¶

For models that support extended thinking (reasoning models), you can control the depth of reasoning:

Default Reasoning Effort: Configure in Settings > Inference. Options include:
- None (Off): No extended thinking
- Low / Medium / High / Extra High: Progressively deeper reasoning
- Last Used: Remembers and applies the most recently used setting
Per-Session Override: Adjust reasoning effort directly in the chat interface for individual sessions.

Default Models¶

Configure default models for different interfaces in Settings > Default Models. This ensures your preferred models are automatically selected when you open each interface.

Available Settings¶

Chat Default Model: The model pre-selected when opening the Chat interface.
Image Default Model: The model pre-selected for image generation tasks.
Agent Default Model: The model pre-selected for agent operations.

Model Selection¶

You can choose from:

Local Models: GGUF models downloaded to your machine (prefixed with "Local")
Cloud Models: Models from configured API providers like OpenAI, Anthropic, or Gemini (shows provider name)

Auto-Load Feature¶

Auto-load local models when set as default: When enabled, local models set as defaults will be automatically loaded and started when you open the corresponding interface.
This is useful for streamlined workflows where you always use the same model.

Diffusion Model Settings¶

When you select a diffusion (image generation) model on the Models page and open its settings, a dedicated Diffusion Config Drawer appears with settings specific to the stable-diffusion.cpp (sd-server) backend. These settings are organized across five tabs.

Overview Tab¶

Displays read-only information about the selected model:

Model name, file size, format, architecture
Diffusion type (SD 1.x, SD 2.x, SDXL, Flux, etc.)
Publisher and repository
File path (with copy button)

Components Tab¶

Configure auxiliary model files used alongside the main diffusion model. Availability of certain fields depends on the model architecture:

VAE model: Custom Variational Autoencoder for higher quality decoding
CLIP-L encoder: Text encoder for SDXL and Flux models
CLIP-G encoder: Additional text encoder for SDXL models only
T5-XXL encoder: Text encoder for Flux models only
TAESD: Tiny AutoEncoder for fast (but lower quality) preview decoding
ControlNet: Model for guided image generation
LoRA directory: Directory containing LoRA adapter files
Embeddings directory: Directory for textual inversion embedding files
Upscale model: ESRGAN model for image upscaling

Generation Defaults Tab¶

Set default values for image generation requests:

Width / Height: Default image dimensions with preset buttons (SD 1.x, SD 2.x, SDXL, Portrait, Landscape)
Steps: Number of denoising steps (1--150). More steps generally produce higher quality but are slower.
CFG scale: Classifier-Free Guidance scale (0--30). Controls how closely the image follows the text prompt.
Guidance: Guidance value for Flux models (0--10)
Sampling method: Algorithm used for denoising (euler, euler_a, dpm++2m, etc.)
Scheduler: Noise schedule type (discrete, karras, exponential, etc.)
CLIP skip: Number of CLIP layers to skip (-1 = none, up to 12)
Seed: Random seed for reproducibility (-1 = random)
Batch count: Number of images per generation (1--8)
Negative prompt: Default text describing what to avoid
Strength: img2img denoising strength (0.0--1.0)

Hardware Tab¶

Configure hardware utilization for the inference server:

Threads: CPU threads to use (-1 = auto, or 1--64)
Flash Attention: Enable memory-efficient attention
Diffusion Flash Attention: Flash Attention for diffusion layers only
VAE tiling: Process the VAE in tiles to reduce VRAM usage
VAE tile size / overlap: Tile dimensions and overlap ratio
CPU offload options: Offload the model, CLIP, or VAE to CPU
Memory mapping (mmap): Use memory-mapped file access
Weight type: Override the model weight precision (f32, f16, q4_0, etc.)

Advanced Tab¶

Fine-tune specialized parameters:

RNG type: Random number generator backend (std_default, cuda, cpu)
Flow shift: Shift value for Flux/SD3 flow-based models
LoRA apply mode: When to apply LoRA weights (auto, immediately, at_runtime)
SLG parameters: Skip Layer Guidance scale, start, and end points
Circular padding: Enable seamless tiling for tileable textures
Prediction type: Noise prediction method (eps, v, edmv, sd3flow, fluxflow, flux2flow)

Settings require restart

Changes to the Components, Hardware, and Advanced tabs require unloading and reloading the model to take effect. Generation Defaults are applied per request and do not require a restart.