8.3. Benchmarking¶
Backend.AI GO includes a powerful benchmarking suite based on llama-bench. This allows you to objectively measure the performance of your hardware and compare different models or quantization levels.
Why Benchmark?¶
-
Hardware Check: Verify if your GPU acceleration (Metal/CUDA) is working correctly.
-
Model Selection: Decide which quantization level (e.g., Q4 vs Q8) offers the best balance of speed and quality for your machine.
-
Performance Tracking: Monitor performance changes after hardware upgrades or driver updates.
Running a Benchmark¶
Navigate to the Models tab, select a model, and click the Benchmark icon (speedometer) to open the benchmarking tool.
1. Quick Test¶

Runs a short, standard test to give you immediate feedback.
-
Settings: Uses default values (e.g., 512 prompt tokens, 128 generation tokens).
-
Use Case: Quick sanity check to see if a model runs at a usable speed.
2. Full Suite (Advanced)¶

Allows you to configure detailed parameters for a comprehensive stress test.
| Parameter | Description | Recommended Value |
|---|---|---|
| Prompt Tokens (PP) | The amount of text fed into the model to simulate reading. | 512, 1024, 4096 |
| Gen Tokens (TG) | The amount of text the model generates. | 128, 256 |
| Batch Size | How many sequences to process in parallel. | 1 (for chat), 512+ (for batch processing) |
| Repetitions | How many times to repeat the test for statistical accuracy. | 5 or more |
| GPU Layers | How many layers to offload to the GPU. | -1 (All) for best performance |
Understanding Results¶
The benchmark provides several key metrics.
Key Metrics¶
| Metric | Full Name | Meaning | Good Range (Example) |
|---|---|---|---|
| TPS | Tokens Per Second | The overall speed of the model. Higher is better. | > 10 t/s (Readable) > 50 t/s (Fast) |
| PP Speed | Prompt Processing | How fast the model "reads" your input. Important for RAG or summarizing long documents. | > 100 t/s (M1) > 1000 t/s (RTX 4090) |
| TG Speed | Text Generation | How fast the model "writes" the response. This determines how smooth the chat feels. | > 20 t/s is ideal for chat. |
Expected Performance¶
Performance varies wildly based on hardware and model size (parameters).
| Hardware | Model Size | Expected TG Speed |
|---|---|---|
| Apple M5 | 1.7B (Q4) | ~100 t/s |
| Apple M4 | 1.7B (Q4) | ~75 t/s |
| Apple M5 | 32B (Q4) | ~0.62 t/s |
| NVIDIA RTX 5090 | 70B (Q4) | ~45 t/s |
| Apple M4 Max | 7B (Q4) | ~110 t/s |
| NVIDIA RTX 4090 | 70B (Q4) | ~25 t/s |
| NVIDIA RTX 3060 | 7B (Q4) | ~50 t/s |
| CPU Only (Modern) | 7B (Q4) | ~2-5 t/s (Very slow) |
* Note: The values above are approximate examples for reference and may vary significantly based on specific hardware configurations, background processes, and thermal conditions.
Comparing & History¶
Backend.AI GO automatically saves your benchmark runs.
-
History Tab: View past results to track performance over time.
-
Comparison: Select multiple runs to see a side-by-side comparison table and charts. This is perfect for visualizing the trade-off between model size and speed.
3. Hardware Profile¶

Find optimal hardware settings by testing various configurations. You can configure:
-
Thread counts: Test with different CPU thread counts (1, 2, 4, 8, 16) to find the optimal value.
-
GPU layer %: Test different percentages of model layers offloaded to GPU (0%, 25%, 50%, 75%, 100%).
-
Flash Attention: Toggle Flash Attention on/off to compare its impact on performance.
The estimated duration is shown before starting, so you can plan accordingly.
Model Comparison Wizard¶
The Model Comparison Wizard provides a guided, step-by-step workflow for comparing multiple models under identical test conditions.
How to Use¶
-
Open the Wizard: Navigate to the Benchmark section and click the Compare Models button to launch the wizard.
-
Step 1 - Select Models: Choose 2 to 4 models you want to compare. A specification preview table shows the selected models' key details (file size, quantization, context length).
-
Step 2 - Configure Parameters: Set the test parameters that will be applied uniformly to all selected models:
- Context size: Number of prompt tokens (256 to 4096)
- Generation length: Number of tokens to generate (32 to 256)
- Repetitions: Number of test runs for statistical accuracy (1 to 5)
-
Step 3 - View Results: After running the comparison, view side-by-side results including:
- Specification Table: File size, quantization, format, and performance metrics (PP/TG speed)
- Performance Chart: Bar charts comparing prompt processing and text generation speeds
- Best Performance Indicators: The fastest model for each metric is highlighted
Understanding Comparison Results¶
| Metric | Description |
|---|---|
| PP Speed | Prompt Processing speed in tokens/second. Higher is better for long inputs. |
| TG Speed | Text Generation speed in tokens/second. Higher means smoother chat experience. |
| GPU Layers | Number of layers offloaded to GPU. "All" indicates full GPU acceleration. |
| Backend | Inference backend used (e.g., Metal, CUDA, CPU). |
The wizard automatically highlights the best-performing model for each metric, making it easy to identify the optimal choice for your use case.
Accessibility¶
Benchmark charts include accessibility features for all users:
- View Toggle: Switch between chart and table views using the toggle in the top-right corner
- Table View: View benchmark data in a screen reader-compatible table format
- ARIA Labels: All charts have descriptive labels for screen readers
- Patterns: Data series use distinct patterns (stripes, dots, etc.) in addition to colors for color-blind users
- High Contrast Mode: Enhanced visibility when system high contrast mode is enabled