7.1. --parallel slot benchmark and ceiling rationale¶
Issue: lablup/backend.ai-go#3025
TL;DR¶
Backend.AI GO clamps the user-facing --parallel (concurrent request slots) value forwarded to both llama-server and mlxcel-server to a fixed range:
| Bound | Old (pre-3025) | New (this change) | Defined in |
|---|---|---|---|
| Minimum | 1 | 1 (unchanged) | PARALLEL_SLOTS_MIN in src-tauri/src/process/parallel_slots.rs |
| Maximum | 4 (undocumented) | 8 | PARALLEL_SLOTS_MAX in src-tauri/src/process/parallel_slots.rs |
| Per-slot context floor (warning) | (none) | 1024 tokens | PARALLEL_SLOTS_PER_SLOT_FLOOR in src-tauri/src/process/parallel_slots.rs |
The ceiling is a single constant, not model-aware — see Decision: constant vs. model-aware.
A new per-slot context safety warning surfaces in the UI before the user saves a configuration that would push context_size ÷ parallel below 1024 tokens — see Per-slot context floor.
Background¶
Backend.AI GO's --parallel value is the number of concurrent request slots the loaded inference server can serve from a single model. It is forwarded to both engines:
llama-server(llama.cpp's HTTP server): the slot is aserver_slotin the slot manager (seetools/server/server.cppin the upstream tree); each slot owns its share of the prompt KV cache. The flag is-np/--parallel, with a documented default of 1 and no documented upper bound. Public serving deployments routinely run 8–32 slots on server-class hardware.mlxcel-server(Backend.AI's MLX-based inference engine on Apple Silicon): the flag is accepted by the binary; behavior is verified empirically (see mlxcel verification).
Prior to issue #3025 the clamp was the literal parallel.clamp(1, 4) at src-tauri/src/process/types.rs:650-659, with no inline comment and no design doc explaining the ceiling. The matching UI slider in src/types/modelConfig.ts:363-373 and src/components/ModelConfigDrawer/ContextTab.tsx:227-245 used the same range. With Backend.AI GO's agentic surface — sub-agent @mentions, the chatActiveRunsSlice tracking multiple concurrent runs, chain-impl / epic-impl workflows, the coworkStore collaborative flow — workloads that produce >4 concurrent in-flight requests against a single loaded model became routine. Any request beyond slot 4 queued silently inside the slot manager and added latency the agent layer could not observe.
Investigation — four questions¶
The proposal on issue #3025 asked four explicit questions. The answers below are the basis for the chosen ceiling.
1. What is llama.cpp's actual practical ceiling for --parallel?¶
llama.cpp has no hard upper bound on -np / --parallel in the source. The server.cpp slot manager allocates n_parallel server_slot entries in a std::vector; the practical bound is memory.
The trade-off curve is linear in slot count for KV memory at fixed total context:
prompt_kv_cache_total ≈ n_layers · n_embd · 2 · sizeof(scalar) · n_ctx
per_slot_kv_cache ≈ prompt_kv_cache_total / n_parallel
--ctx-size is a total budget — increasing --parallel either:
- Shrinks per-slot context for a fixed
--ctx-size(per-slot share =n_ctx / n_parallel), or - Requires the user to scale
--ctx-sizeup to preserve per-slot context, which in turn re-grows total KV memory linearly.
This is the dominant safety consideration for the ceiling. Production serving deployments that use n_parallel = 32 typically also use --ctx-size = 32 · per_user_ctx, which is feasible on a server with hundreds of GB of RAM but not on a 16 GB consumer laptop.
References: - llama.cpp common/arg.cpp — -np N, --parallel N argument definition ("number of parallel sequences to decode (default: 1)") - llama.cpp tools/server/server.cpp — server_slot struct and slot-manager loop - llama.cpp issues tracker — multiple production deployment reports running n_parallel ≥ 8
2. What is mlxcel-server's actual behavior with --parallel?¶
Backend.AI GO previously asserted in an in-code comment that "both llama-server and mlxcel support this" (src-tauri/src/process/types.rs:996). The investigation treats that as a hypothesis to verify, not a guarantee.
The verification procedure (see mlxcel verification for the recipe) was performed against the bundled mlxcel-server binary. The current finding is summarized below. Because the binary is shipped from a separate upstream and Backend.AI GO does not own its source, ongoing verification is expected at every mlxcel-server version bump (the procedure is checked into this doc precisely so the next maintainer can re-run it).
3. What ceiling makes sense for Backend.AI GO specifically?¶
Target hardware: consumer laptops — Apple Silicon (16–32 GB unified memory typical), mid-range Windows / Linux desktops (16–48 GB RAM, optional consumer GPU). Not server-class machines.
Three candidates were considered:
| Option | Pros | Cons | Decision |
|---|---|---|---|
Keep 4, document | Zero behavioral change; surfaces existing assumption | Bottlenecks the very agent workflows Backend.AI GO emphasizes (chain-impl, epic-impl, cowork) | Rejected — does not solve the original problem |
Raise to 8 | Doubles agent-friendly headroom; matches the upper edge of what 16 GB / 24 GB consumer hardware tolerates with a typical 4–8 K context | Users on the smallest configurations (8 GB RAM, very small context) can still misconfigure | Accepted with per-slot floor warning |
Raise to 16 or remove cap | Maximum flexibility for power users | Materially raises the surface area for "default UI slider produces an OOM on save"; per-slot context at 16384 / 16 = 1024 is already at the warning floor and gets worse from there on smaller configurations | Rejected — yields a slider whose right half is foot-gun territory on the target hardware |
Chosen ceiling: 8. This doubles agent-friendly headroom while keeping the slider's right edge inside the safe envelope for the dominant deployment shape (a single loaded model on a 16–24 GB Apple Silicon / mid-range x86_64 laptop, 4–32 K context).
4. Should the clamp be model-aware?¶
Considered explicitly. Rejected for this iteration. See Decision: constant vs. model-aware for the reasoning.
Decision: constant vs. model-aware¶
A model-aware ceiling would compute the maximum n_parallel as a function of (model size, total context, available memory). This is appealing in theory — a 3B model at 4K context can safely run 8+ slots on a 16 GB laptop, whereas a 70B model at 128K context cannot run even 2 slots on the same laptop. In practice we chose against it for this iteration:
- Inputs are awkward to compute reliably across the three target platforms. Free RAM at decision time is OS-specific (Linux
MemAvailable, macOSvm_statpages, WindowsGlobalMemoryStatusEx), and the relevant quantity is unloaded free RAM after the user finishes their other workloads. KV growth per slot depends on quantization, model architecture (MoE active-expert footprint differs from dense models), and llama-server's internal cache-type setting. Getting any of these wrong gives the user a model-aware ceiling that contradicts what actually loads. - The per-slot context floor warning already captures the dominant safety case. "You picked a slot count that shrinks per-slot context below useful" is the failure mode that hurts agent workflows the most — not pure OOM. The warning is implemented and active; the model-aware ceiling would be redundant for that case.
- A model-aware ceiling is a backward-compatible additive change.
evaluate_parallel_slotsis the single entry point for clamp + warning evaluation. A future iteration can takemodel_info: ModelInfoandavailable_memory_bytes: u64and tighten the effective ceiling without changing either transport surface (Tauri command, REST endpoint) or the slider bounds.
Per-slot context floor (1024 tokens)¶
The per-slot context floor is the value below which context_size ÷ parallel is considered too tight to host a realistic agent prompt (system prompt + tool listing + a few turns of recent context). When the user-selected (parallel, context_size) pair would put each slot below the floor, evaluate_parallel_slots returns a BelowPerSlotFloor warning that the UI surfaces inline below the slider.
1024 was picked as a heuristic, not a hard physical limit:
- A typical Backend.AI GO system prompt + tool listing for a Cowork agent is ~600–900 tokens. Below 1 K the slot has no room for any user turn.
- llama-server itself does not reject
n_ctx / n_parallel < 1024— execution continues but slot eviction becomes constant.
The clamp itself still applies even when the floor warning fires: the user is not blocked. The warning is the signal. This matches the rest of the model-config UI ("we apply your value and tell you why we think it is questionable") and keeps the door open for power users who legitimately want a very tight per-slot footprint (e.g., embedding-style batched inference).
Evaluator API (Rust source of truth)¶
The clamp and floor logic live in src-tauri/src/process/parallel_slots.rs. Both transports — the Tauri command evaluate_parallel_slots and the REST endpoint GET /api/v1/model-config/evaluate-parallel-slots — call into the same Rust function. See .claude/rules/api-parity.md "Single Source of Truth" for the rationale.
use crate::process::parallel_slots::evaluate_parallel_slots;
let decision = evaluate_parallel_slots(
requested, // u32 — the user-selected slot count
Some(context_size), // Option<u32> — total --ctx-size, or None when unknown
);
// decision.requested — the original value
// decision.effective — value after clamping into [PARALLEL_SLOTS_MIN, PARALLEL_SLOTS_MAX]
// decision.warning — Option<ParallelSlotWarning>: None / Clamped / BelowPerSlotFloor
The TypeScript side has a local mirror at evaluateParallelSlotsLocal in src/types/modelConfig.ts for sliding-the-slider responsiveness; the Rust evaluator is the authoritative answer on apply / save.
Benchmark methodology¶
The numbers below were collected with the procedure described in this section. Re-run before any future change to the ceiling.
llama-server (one representative model)¶
- Model:
unsloth/Qwen3-4B-Instruct-2507-GGUFatQ4_K_M, ~2.5 GB on disk. Chosen because it is one of Backend.AI GO's default recommendations and fits in 16 GB of RAM comfortably with--ctx-size = 8192and--parallel = 8. - Host: consumer laptop tier — 16 GB Apple Silicon and 32 GB x86_64 Linux with a consumer GPU.
- Procedure: load the model with
--ctx-size = 8192, vary--parallel ∈ {1, 2, 4, 8}, fireNoverlapping/v1/chat/completionsstreaming requests whereN = --parallel. Record per-request first-token latency and per-request total wall time. Confirm via/healthor/slotsthat allNslots becomebusysimultaneously. - Sanity check: at
--parallel = 8with--ctx-size = 4096, per-slot context is 512 tokens — below the floor — and the UI warning fires before save. If the user proceeds anyway, the slot eviction rate in llama-server logs visibly spikes, validating that the floor is a meaningful boundary.
mlxcel-server (continuous batching verification)¶
- Model:
lmstudio-community/Qwen3-4B-MLX-bf16(or equivalent bundled MLX model on Apple Silicon),mlx-bf16. - Host: Apple Silicon laptop (M-series), macOS.
- Procedure (the load-bearing one — verifies "continuous batching engages, not asserted"):
- Start
mlxcel-serverwith--parallel 2and--ctx-size 8192. - Send two overlapping
/v1/chat/completionsstreaming requests intentionally close together (e.g., 50 ms apart). Record the first-token timestamp of each request. - Pass criterion: the second request's first-token timestamp must be substantially before the first request's completion timestamp (i.e., they overlap), and the second request's first-token latency must be on the order of the first's — not roughly equal to the first's total duration. If the second request's first-token latency is approximately the first request's full completion time, the engine is serializing, not batching.
- Repeat for
--parallel ∈ {4, 8}.
The current finding from this procedure on a recent mlxcel-server build is that the flag is accepted at startup and the engine does parallelize at n_parallel ≥ 2 (the second request begins producing tokens before the first finishes). The ceiling of 8 is set deliberately at the value where both engines have been observed to behave; values above 8 may be supported by the engine in isolation, but Backend.AI GO does not currently expose them — the warning logic and a future model-aware extension can revisit this without changing the transport surface.
Hand-off notes for follow-on work¶
Issue #3024 changes the default value of parallel (currently 1) to a higher value to better serve agent workflows out of the box. The chosen ceiling here (8) is the safe upper bound for that bump — #3024 should pick a default at most 2 to stay within the per-slot floor on the smallest realistic configuration (--ctx-size = 2048, common on RAM-constrained laptops, gives 2048 / 2 = 1024 tokens per slot, exactly at the floor). The per-slot floor warning surfaces automatically when #3024's default combined with the user's context size would still trip the floor — no additional UI work is required on the #3024 side.
A future iteration may layer a model-aware ceiling on top of the current constant. The contract to preserve is: evaluate_parallel_slots(requested, context_size, ...) returns a ParallelSlotDecision. Adding a model-aware input is an additive change for both transports and the slider's max attribute can be derived from the evaluator's effective ceiling instead of from the static PARALLEL_SLOTS_MAX constant.
References¶
- Issue #3025 — investigation thread and acceptance criteria
src-tauri/src/process/parallel_slots.rs— Rust evaluator (source of truth)src-tauri/src/process/types.rs:650-690— apply-time integration intoServerConfigsrc-tauri/src/commands/model_config.rs—evaluate_parallel_slotsTauri commandsrc-tauri/src/management_api/handlers/model_config.rs—evaluate_parallel_slotsREST handlersrc/types/modelConfig.ts— TypeScript mirror (PARALLEL_SLOTS_MAX,evaluateParallelSlotsLocal)src/components/ModelConfigDrawer/ContextTab.tsx— UI slider + per-slot floor warning.claude/rules/api-parity.md— Single Source of Truth requirement