8.3. Benchmarking¶

Backend.AI GO includes a powerful benchmarking suite based on llama-bench. This allows you to objectively measure the performance of your hardware and compare different models or quantization levels.

Why Benchmark?¶

Hardware Check: Verify if your GPU acceleration (Metal/CUDA) is working correctly.
Model Selection: Decide which quantization level (e.g., Q4 vs Q8) offers the best balance of speed and quality for your machine.
Performance Tracking: Monitor performance changes after hardware upgrades or driver updates.

Running a Benchmark¶

Navigate to the Models tab, select a model, and click the Benchmark icon (speedometer) to open the benchmarking tool.

1. Quick Test¶

Quick benchmark test

Runs a short, standard test to give you immediate feedback.

Settings: Uses default values (e.g., 512 prompt tokens, 128 generation tokens).
Use Case: Quick sanity check to see if a model runs at a usable speed.

2. Full Suite (Advanced)¶

Full benchmark suite

Allows you to configure detailed parameters for a comprehensive stress test.

Parameter	Description	Recommended Value
Prompt Tokens (PP)	The amount of text fed into the model to simulate reading.	`512`, `1024`, `4096`
Gen Tokens (TG)	The amount of text the model generates.	`128`, `256`
Batch Size	How many sequences to process in parallel.	`1` (for chat), `512+` (for batch processing)
Repetitions	How many times to repeat the test for statistical accuracy.	`5` or more
GPU Layers	How many layers to offload to the GPU.	`-1` (All) for best performance

Understanding Results¶

The benchmark provides several key metrics.

Key Metrics¶

Metric	Full Name	Meaning	Good Range (Example)
TPS	Tokens Per Second	The overall speed of the model. Higher is better.	> 10 t/s (Readable) > 50 t/s (Fast)
PP Speed	Prompt Processing	How fast the model "reads" your input. Important for RAG or summarizing long documents.	> 100 t/s (M1) > 1000 t/s (RTX 4090)
TG Speed	Text Generation	How fast the model "writes" the response. This determines how smooth the chat feels.	> 20 t/s is ideal for chat.

Expected Performance¶

Performance varies wildly based on hardware and model size (parameters).

Hardware	Model Size	Expected TG Speed
Apple M5	1.7B (Q4)	~100 t/s
Apple M4	1.7B (Q4)	~75 t/s
Apple M5	32B (Q4)	~0.62 t/s
NVIDIA RTX 5090	70B (Q4)	~45 t/s
Apple M4 Max	7B (Q4)	~110 t/s
NVIDIA RTX 4090	70B (Q4)	~25 t/s
NVIDIA RTX 3060	7B (Q4)	~50 t/s
CPU Only (Modern)	7B (Q4)	~2-5 t/s (Very slow)

* Note: The values above are approximate examples for reference and may vary significantly based on specific hardware configurations, background processes, and thermal conditions.

Comparing & History¶

Backend.AI GO automatically saves your benchmark runs.

History Tab: View past results to track performance over time.
Comparison: Select multiple runs to see a side-by-side comparison table and charts. This is perfect for visualizing the trade-off between model size and speed.

3. Hardware Profile¶

Hardware profile benchmark

Find optimal hardware settings by testing various configurations. You can configure:

Thread counts: Test with different CPU thread counts (1, 2, 4, 8, 16) to find the optimal value.
GPU layer %: Test different percentages of model layers offloaded to GPU (0%, 25%, 50%, 75%, 100%).
Flash Attention: Toggle Flash Attention on/off to compare its impact on performance.

The estimated duration is shown before starting, so you can plan accordingly.

Model Comparison Wizard¶

The Model Comparison Wizard provides a guided, step-by-step workflow for comparing multiple models under identical test conditions.

How to Use¶

Open the Wizard: Navigate to the Benchmark section and click the Compare Models button to launch the wizard.
Step 1 - Select Models: Choose 2 to 4 models you want to compare. A specification preview table shows the selected models' key details (file size, quantization, context length).
Step 2 - Configure Parameters: Set the test parameters that will be applied uniformly to all selected models:
- Context size: Number of prompt tokens (256 to 4096)
- Generation length: Number of tokens to generate (32 to 256)
- Repetitions: Number of test runs for statistical accuracy (1 to 5)
Step 3 - View Results: After running the comparison, view side-by-side results including:
- Specification Table: File size, quantization, format, and performance metrics (PP/TG speed)
- Performance Chart: Bar charts comparing prompt processing and text generation speeds
- Best Performance Indicators: The fastest model for each metric is highlighted

Understanding Comparison Results¶

Metric	Description
PP Speed	Prompt Processing speed in tokens/second. Higher is better for long inputs.
TG Speed	Text Generation speed in tokens/second. Higher means smoother chat experience.
GPU Layers	Number of layers offloaded to GPU. "All" indicates full GPU acceleration.
Backend	Inference backend used (e.g., Metal, CUDA, CPU).

The wizard automatically highlights the best-performing model for each metric, making it easy to identify the optimal choice for your use case.

Accessibility¶

Benchmark charts include accessibility features for all users:

View Toggle: Switch between chart and table views using the toggle in the top-right corner
Table View: View benchmark data in a screen reader-compatible table format
ARIA Labels: All charts have descriptive labels for screen readers
Patterns: Data series use distinct patterns (stripes, dots, etc.) in addition to colors for color-blind users
High Contrast Mode: Enhanced visibility when system high contrast mode is enabled