Skip to content

8.3. Benchmarking

Backend.AI GO includes a powerful benchmarking suite based on llama-bench. This allows you to objectively measure the performance of your hardware and compare different models or quantization levels.

Why Benchmark?

  • Hardware Check: Verify if your GPU acceleration (Metal/CUDA) is working correctly.

  • Model Selection: Decide which quantization level (e.g., Q4 vs Q8) offers the best balance of speed and quality for your machine.

  • Performance Tracking: Monitor performance changes after hardware upgrades or driver updates.

Running a Benchmark

Navigate to the Models tab, select a model, and click the Benchmark icon (speedometer) to open the benchmarking tool.

1. Quick Test

Quick benchmark test Quick benchmark test

Runs a short, standard test to give you immediate feedback.

  • Settings: Uses default values (e.g., 512 prompt tokens, 128 generation tokens).

  • Use Case: Quick sanity check to see if a model runs at a usable speed.

2. Full Suite (Advanced)

Full benchmark suite Full benchmark suite

Allows you to configure detailed parameters for a comprehensive stress test.

Parameter Description Recommended Value
Prompt Tokens (PP) The amount of text fed into the model to simulate reading. 512, 1024, 4096
Gen Tokens (TG) The amount of text the model generates. 128, 256
Batch Size How many sequences to process in parallel. 1 (for chat), 512+ (for batch processing)
Repetitions How many times to repeat the test for statistical accuracy. 5 or more
GPU Layers How many layers to offload to the GPU. -1 (All) for best performance

Understanding Results

The benchmark provides several key metrics.

Key Metrics

Metric Full Name Meaning Good Range (Example)
TPS Tokens Per Second The overall speed of the model. Higher is better. > 10 t/s (Readable)
> 50 t/s (Fast)
PP Speed Prompt Processing How fast the model "reads" your input. Important for RAG or summarizing long documents. > 100 t/s (M1)
> 1000 t/s (RTX 4090)
TG Speed Text Generation How fast the model "writes" the response. This determines how smooth the chat feels. > 20 t/s is ideal for chat.

Expected Performance

Performance varies wildly based on hardware and model size (parameters).

Hardware Model Size Expected TG Speed
Apple M5 1.7B (Q4) ~100 t/s
Apple M4 1.7B (Q4) ~75 t/s
Apple M5 32B (Q4) ~0.62 t/s
NVIDIA RTX 5090 70B (Q4) ~45 t/s
Apple M4 Max 7B (Q4) ~110 t/s
NVIDIA RTX 4090 70B (Q4) ~25 t/s
NVIDIA RTX 3060 7B (Q4) ~50 t/s
CPU Only (Modern) 7B (Q4) ~2-5 t/s (Very slow)

* Note: The values above are approximate examples for reference and may vary significantly based on specific hardware configurations, background processes, and thermal conditions.

Comparing & History

Backend.AI GO automatically saves your benchmark runs.

  • History Tab: View past results to track performance over time.

  • Comparison: Select multiple runs to see a side-by-side comparison table and charts. This is perfect for visualizing the trade-off between model size and speed.

3. Hardware Profile

Hardware profile benchmark Hardware profile benchmark

Find optimal hardware settings by testing various configurations. You can configure:

  • Thread counts: Test with different CPU thread counts (1, 2, 4, 8, 16) to find the optimal value.

  • GPU layer %: Test different percentages of model layers offloaded to GPU (0%, 25%, 50%, 75%, 100%).

  • Flash Attention: Toggle Flash Attention on/off to compare its impact on performance.

The estimated duration is shown before starting, so you can plan accordingly.

Model Comparison Wizard

The Model Comparison Wizard provides a guided, step-by-step workflow for comparing multiple models under identical test conditions.

How to Use

  1. Open the Wizard: Navigate to the Benchmark section and click the Compare Models button to launch the wizard.

  2. Step 1 - Select Models: Choose 2 to 4 models you want to compare. A specification preview table shows the selected models' key details (file size, quantization, context length).

  3. Step 2 - Configure Parameters: Set the test parameters that will be applied uniformly to all selected models:

    • Context size: Number of prompt tokens (256 to 4096)
    • Generation length: Number of tokens to generate (32 to 256)
    • Repetitions: Number of test runs for statistical accuracy (1 to 5)
  4. Step 3 - View Results: After running the comparison, view side-by-side results including:

    • Specification Table: File size, quantization, format, and performance metrics (PP/TG speed)
    • Performance Chart: Bar charts comparing prompt processing and text generation speeds
    • Best Performance Indicators: The fastest model for each metric is highlighted

Understanding Comparison Results

Metric Description
PP Speed Prompt Processing speed in tokens/second. Higher is better for long inputs.
TG Speed Text Generation speed in tokens/second. Higher means smoother chat experience.
GPU Layers Number of layers offloaded to GPU. "All" indicates full GPU acceleration.
Backend Inference backend used (e.g., Metal, CUDA, CPU).

The wizard automatically highlights the best-performing model for each metric, making it easy to identify the optimal choice for your use case.

Accessibility

Benchmark charts include accessibility features for all users:

  • View Toggle: Switch between chart and table views using the toggle in the top-right corner
  • Table View: View benchmark data in a screen reader-compatible table format
  • ARIA Labels: All charts have descriptive labels for screen readers
  • Patterns: Data series use distinct patterns (stripes, dots, etc.) in addition to colors for color-blind users
  • High Contrast Mode: Enhanced visibility when system high contrast mode is enabled