Skip to content

Benchmarking

Backend.AI GO includes a powerful benchmarking suite based on llama-bench. This allows you to objectively measure the performance of your hardware and compare different models or quantization levels.

Why Benchmark?

  • Hardware Check: Verify if your GPU acceleration (Metal/CUDA) is working correctly.

  • Model Selection: Decide which quantization level (e.g., Q4 vs Q8) offers the best balance of speed and quality for your machine.

  • Performance Tracking: Monitor performance changes after hardware upgrades or driver updates.

Running a Benchmark

Navigate to the Models tab, select a model, and click the Benchmark icon (speedometer) to open the benchmarking tool.

1. Quick Test

Runs a short, standard test to give you immediate feedback.

  • Settings: Uses default values (e.g., 512 prompt tokens, 128 generation tokens).

  • Use Case: Quick sanity check to see if a model runs at a usable speed.

2. Full Suite (Advanced)

Allows you to configure detailed parameters for a comprehensive stress test.

Parameter Description Recommended Value
Prompt Tokens (PP) The amount of text fed into the model to simulate reading. 512, 1024, 4096
Gen Tokens (TG) The amount of text the model generates. 128, 256
Batch Size How many sequences to process in parallel. 1 (for chat), 512+ (for batch processing)
Repetitions How many times to repeat the test for statistical accuracy. 5 or more
GPU Layers How many layers to offload to the GPU. -1 (All) for best performance

Understanding Results

The benchmark provides several key metrics.

Key Metrics

Metric Full Name Meaning Good Range (Example)
TPS Tokens Per Second The overall speed of the model. Higher is better. > 10 t/s (Readable)
> 50 t/s (Fast)
PP Speed Prompt Processing How fast the model "reads" your input. Important for RAG or summarizing long documents. > 100 t/s (M1)
> 1000 t/s (RTX 4090)
TG Speed Text Generation How fast the model "writes" the response. This determines how smooth the chat feels. > 20 t/s is ideal for chat.

Expected Performance

Performance varies wildly based on hardware and model size (parameters).

Hardware Model Size Expected TG Speed
Apple M5 1.7B (Q4) ~100 t/s
Apple M4 1.7B (Q4) ~75 t/s
Apple M5 32B (Q4) ~0.62 t/s
NVIDIA RTX 5090 70B (Q4) ~45 t/s
Apple M4 Max 7B (Q4) ~110 t/s
NVIDIA RTX 4090 70B (Q4) ~25 t/s
NVIDIA RTX 3060 7B (Q4) ~50 t/s
CPU Only (Modern) 7B (Q4) ~2-5 t/s (Very slow)

* Note: The values above are approximate examples for reference and may vary significantly based on specific hardware configurations, background processes, and thermal conditions.

Comparing & History

Backend.AI GO automatically saves your benchmark runs.

  • History Tab: View past results to track performance over time.

  • Comparison: Select multiple runs to see a side-by-side comparison table and charts. This is perfect for visualizing the trade-off between model size and speed.