7.1. Parallel Request Slots¶

A loaded model serves requests through a fixed number of parallel request slots. The slot count is the Parallel requests setting in the Model Config Drawer's Context tab, and Backend.AI GO forwards it to the inference engine (llama-server or mlxcel-server) as the --parallel flag.

What the setting does¶

Each in-flight request occupies one slot. Requests beyond the slot count wait in the engine's internal queue, where the delay surfaces only as longer first-token latency.

Agent features routinely send several requests to the same loaded model at once: a chat turn that @mentions sub-agents, Cowork sub-agents working in parallel, or multiple Squad agents sharing one local model. With a single slot, all of those calls serialize at the engine even though the application issued them concurrently.

Range and default¶

Bound	Value
Minimum	1
Maximum	8
Default	2
Per-slot context floor (warning)	1,024 tokens

The default of 2 keeps the smallest realistic agent workload, a primary agent plus one sub-agent, from serializing. A value you set explicitly is always preserved, including 1 for deliberately single-slot operation.

The ceiling is a fixed constant, not derived from the loaded model. llama-server itself has no hard upper bound (production server deployments run 8 to 32 slots on server-class hardware), but on the consumer laptops Backend.AI GO targets, values above 8 combined with realistic context sizes mostly produce out-of-memory loads or per-slot contexts below the floor. 8 is also the highest value at which both bundled engines have been verified to serve concurrently.

Context is shared across slots¶

Context Length is a total budget. The engine divides it equally among slots, so each request works with context_size ÷ parallel tokens:

Context Length	Parallel requests	Per-slot context
8192	2	4096
8192	8	1024
4096	8	512 (below the floor, warning shown)

Raising the slot count without raising the context length shrinks every request's working room. Raising both preserves per-slot context but grows KV-cache memory roughly linearly, so a 16 GB machine that comfortably runs a 4B model with an 8K context and 8 slots will not do the same with a 70B model.

Per-slot floor warning¶

When context_size ÷ parallel would drop below 1,024 tokens, a warning appears under the slider before you save. The warning does not block the save. 1,024 is a heuristic, not a physical limit: a typical agent system prompt plus tool listing already takes 600 to 900 tokens, so below 1K a slot has almost no room left for the actual conversation. The engine still accepts tighter values, which remain useful for special workloads such as embedding-style batched inference, but expect constant slot eviction in normal chat or agent use.

Choosing a value¶

Plain chat, one conversation at a time: 1 or 2 slots. There is nothing to parallelize.
Agent workflows (Cowork, Squad, sub-agent mentions): match the slot count to the number of agents that realistically run at once, and scale the context length with it. 4 slots with a 16K context gives each agent 4K of working room.
Memory-constrained machines (8 to 16 GB): prefer fewer slots with a larger per-slot share over many tight slots. Watch for the floor warning.

Verifying concurrency on your hardware¶

You can confirm that slots actually serve in parallel rather than serializing.

llama-server¶

Load a small model (for example Qwen3-4B-Instruct at Q4_K_M) with Context Length 8192 and Parallel requests 4.
Fire 4 overlapping streaming requests against /v1/chat/completions.
Check /health or /slots on the inference port: all 4 slots should report busy simultaneously, and each request's first token should arrive long before the others finish.

mlxcel-server (macOS)¶

Load an MLX model with Parallel requests 2 and Context Length 8192.
Send two streaming requests about 50 ms apart and record each request's first-token timestamp.
The second request should start producing tokens well before the first one completes. If the second request's first token only arrives after roughly the first request's full duration, the engine is serializing instead of batching.