Benchmarking¶
Backend.AI GO includes a powerful benchmarking suite based on llama-bench. This allows you to objectively measure the performance of your hardware and compare different models or quantization levels.
Why Benchmark?¶
-
Hardware Check: Verify if your GPU acceleration (Metal/CUDA) is working correctly.
-
Model Selection: Decide which quantization level (e.g., Q4 vs Q8) offers the best balance of speed and quality for your machine.
-
Performance Tracking: Monitor performance changes after hardware upgrades or driver updates.
Running a Benchmark¶
Navigate to the Models tab, select a model, and click the Benchmark icon (speedometer) to open the benchmarking tool.
1. Quick Test¶
Runs a short, standard test to give you immediate feedback.
-
Settings: Uses default values (e.g., 512 prompt tokens, 128 generation tokens).
-
Use Case: Quick sanity check to see if a model runs at a usable speed.
2. Full Suite (Advanced)¶
Allows you to configure detailed parameters for a comprehensive stress test.
| Parameter | Description | Recommended Value |
|---|---|---|
| Prompt Tokens (PP) | The amount of text fed into the model to simulate reading. | 512, 1024, 4096 |
| Gen Tokens (TG) | The amount of text the model generates. | 128, 256 |
| Batch Size | How many sequences to process in parallel. | 1 (for chat), 512+ (for batch processing) |
| Repetitions | How many times to repeat the test for statistical accuracy. | 5 or more |
| GPU Layers | How many layers to offload to the GPU. | -1 (All) for best performance |
Understanding Results¶
The benchmark provides several key metrics.
Key Metrics¶
| Metric | Full Name | Meaning | Good Range (Example) |
|---|---|---|---|
| TPS | Tokens Per Second | The overall speed of the model. Higher is better. | > 10 t/s (Readable) > 50 t/s (Fast) |
| PP Speed | Prompt Processing | How fast the model "reads" your input. Important for RAG or summarizing long documents. | > 100 t/s (M1) > 1000 t/s (RTX 4090) |
| TG Speed | Text Generation | How fast the model "writes" the response. This determines how smooth the chat feels. | > 20 t/s is ideal for chat. |
Expected Performance¶
Performance varies wildly based on hardware and model size (parameters).
| Hardware | Model Size | Expected TG Speed |
|---|---|---|
| Apple M5 | 1.7B (Q4) | ~100 t/s |
| Apple M4 | 1.7B (Q4) | ~75 t/s |
| Apple M5 | 32B (Q4) | ~0.62 t/s |
| NVIDIA RTX 5090 | 70B (Q4) | ~45 t/s |
| Apple M4 Max | 7B (Q4) | ~110 t/s |
| NVIDIA RTX 4090 | 70B (Q4) | ~25 t/s |
| NVIDIA RTX 3060 | 7B (Q4) | ~50 t/s |
| CPU Only (Modern) | 7B (Q4) | ~2-5 t/s (Very slow) |
* Note: The values above are approximate examples for reference and may vary significantly based on specific hardware configurations, background processes, and thermal conditions.
Comparing & History¶
Backend.AI GO automatically saves your benchmark runs.
-
History Tab: View past results to track performance over time.
-
Comparison: Select multiple runs to see a side-by-side comparison table and charts. This is perfect for visualizing the trade-off between model size and speed.