Finding the Concurrency Knee on an L4 GPU

The setup

An agentic coding workload on commodity hardware

The setup is a single g6.4xlarge EC2 instance with one NVIDIA L4 GPU (24 GB VRAM), running Qwen3-4B with FP8 weights on vLLM 0.20.2, automatic prefix caching (APC) enabled.

The workload models a lightweight coding agent mid-task: a 10,000-token shared system prompt (repository context plus agent instructions) followed by a 12-turn conversation where each turn appends ~1,000 tokens of unique context (tool call inputs, code snippets, responses). At turn 12, total context reaches roughly 22,800 tokens per session. Output is capped at 75 tokens per turn. This is heavily input-dominated, as agentic workloads tend to be.

System prompt:          10,000 tokens  (shared across sessions → APC cached)
Per-session turns:      ~12,800 tokens across 12 turns  (unique per session)
Output per turn:        75 tokens
—————————————————
Total at turn 12:       ~22,800 tokens

APC caches the system prompt once and amortizes its KV cost across all sessions (the blocks still occupy GPU memory, but only one copy exists). Within a session, APC also caches the growing turn history: turn n+1 extends the exact prefix from turn n, so each turn only prefills the new ~1,000 tokens if the prior turns' KV blocks survive in cache. But the per-session turn history is unique (different code, different tool outputs) and cannot be shared across sessions. When concurrency pressure forces eviction of a session's blocks, the next request in that session must re-prefill the full accumulated history (up to ~12,800 tokens at turn 12). That re-prefill cost is what drives the TTFT cliff.

We swept concurrency from 1 to 48 concurrent sessions, measuring TTFT at each level, across three KV cache precisions: fp16, fp8, and turboquant 4-bit. We also ran best-case (all context cached) and worst-case (all context re-prefilled) bounds to bracket where realistic performance should land.

Results

The concurrency cliff

The chart below shows TTFT percentiles (p50, p95, p99) and throughput across the full concurrency sweep with fp8 KV cache. The Y axis is logarithmic. The cliff is sharp.

TTFT p50 / p95 / p99 and throughput (req/s, green dashed, right axis) vs concurrent sessions. fp8 KV cache, Qwen3-4B FP8 on g6.4xlarge (L4). Shaded zone: 12c–16c collapse transition. Dashed line: KV cache fills at 14c.

Below 12 concurrent sessions, p99 TTFT stays under 2.6 seconds and throughput climbs steadily to a peak of 1.36 req/s. At 14 sessions, p99 jumps to 38.9 seconds in a single step, a 15× increase. Throughput drops 23% simultaneously.

At 12c, everything fits in KV cache. At 14c, filling the 14th session forces eviction of another session's blocks. That evicted session must re-prefill its full 12,800 tokens of unique context at its next turn, which triggers a cascade that collapses latency.

The cliff is between 12c and 14c. Below 12c, every session fits in KV cache and p99 stays under 2.6s. At 14c, p99 explodes to 39s. Throughput peaks at 12c (1.36 req/s) and never recovers. For a 2-second p99 SLA, the safe operating point is 6 concurrent sessions.

After the cliff: p50 lies to you

Above the cliff, p50 and p99 live in completely different regimes. At 20c, p50 is 2.2s (looks manageable) while p99 is 44.8s (catastrophic). This bimodal distribution arises because APC creates two populations of requests:

Lucky requests hit warm cache entries for their session context. They prefill only the latest turn and complete quickly.
Unlucky requests arrive after their session's blocks were evicted. They re-prefill the full 12,800 tokens, taking 30–50 seconds under load.

The p50 reflects the lucky cohort. The p99 reflects the unlucky cohort. A single TTFT average is meaningless past the cliff. You must look at tail latency to see the failure.

Full sweep data

C	TTFT p50	TTFT p95	TTFT p99	req/s
1	388	487	493	0.42
2	541	931	965	0.68
4	717	990	1.13s	1.01
6	1.06s	1.43s	1.75s	1.18
8	1.14s	1.73s	2.16s	1.23
10	1.19s	1.89s	2.31s	1.33
12	1.25s	2.16s	2.60s	1.36
14	1.29s	8.93s	38.9s	1.05
16	2.10s	25.4s	35.4s	0.78
18	2.27s	24.3s	50.5s	0.66
20	2.25s	25.5s	44.8s	0.68
24	2.40s	32.5s	45.4s	0.72
28	2.55s	31.5s	34.9s	0.74
32	2.58s	34.1s	49.7s	0.77
40	3.00s	39.2s	50.5s	0.84
48	5.74s	40.5s	43.2s	0.89

Quantization

KV cache precision moves the knee

Running the same workload with 16-bit KV cache (vLLM defaults to the model's dtype, bf16 for Qwen3, when --kv-cache-dtype is not set) halves the token capacity. The knee shifts left proportionally:

KV dtype	Bits/element	Token capacity	Observed knee	2s p99 ceiling
bf16/fp16	16	~89K	~8c	~4c
fp8	8	~178K	~14c	~6c
turboquant 4-bit	~4.2	~275K	~23c (est.)	pending

The fp16→fp8 shift is confirmed: fp16 knees at ~8c, fp8 at ~14c, a 1.75× shift for a 2× capacity increase. The slight compression below 2× is expected: KV management overhead and block table fragmentation consume some of the headroom regardless of precision.

TTFT p99 comparison: fp8 KV (cyan) vs fp16 KV (indigo). Same workload, same GPU. Vertical markers at the observed knees (8c for fp16, 14c for fp8). The tq4 knee at ~23c is a capacity-based estimate (accounting for lower gpu_memory_utilization) pending experiment completion.

Quantization directly buys concurrency headroom. Halving KV precision from fp16 to fp8 nearly doubles how many concurrent sessions fit before the cliff. Turboquant 4-bit (~3.8× fewer bytes per element vs fp16, partially offset by the lower gpu_memory_utilization needed for autotuning scratch space) predicts a knee at ~23c, roughly 3× more concurrent sessions than fp16.

The accuracy tradeoff may be small. KV cache quantization at 4-bit typically reports low single-digit perplexity impact, though the exact effect depends on model and task. The knee shifts from ~8c (fp16) to ~14c (fp8) to an estimated ~23c (tq4). Roughly 3× more sessions before eviction onset, from the same GPU.

Bounds

Best case, worst case, realistic

To understand how much of the latency budget is fundamental (prefill compute) vs avoidable (cache misses), we ran two controlled bounds alongside the realistic workload:

Best case (miss_rate=0.0): every request hits the same cached content. APC caches the full 12,800-token session context. Only ~200 unique tokens need prefill. This represents perfect KV utilization.
Worst case (miss_rate=1.0): every request gets a unique bust prefix that breaks APC for user context. The 10k system prompt still hits the cache, but all ~12,800 tokens of per-session turn history must be re-prefilled on every request. Crucially, every miss occurs at peak session depth (turn 12), forcing the maximum possible re-prefill cost each time.

TTFT p50: best (fully cached, green), realistic (natural APC, cyan), and worst (always re-prefilled, red). Log scale. The shaded band between best and worst is the envelope where any real workload must land.

What the bounds tell us

At 1 concurrent session with zero contention:

Workload	TTFT p50	What's happening
Best (cached)	57 ms	Only ~200 unique tokens prefilled; rest is cached
Realistic (APC)	388 ms	System prompt cached; 12,800 unique tokens prefilled
Worst (evicted)	2,500 ms	System prompt cached; ~12,800 user tokens re-prefilled at peak depth every request

On an absolute scale, realistic (388ms) is much closer to best (57ms) than to worst (2,500ms). But realistic is still 7× slower than best. That gap is the cost of prefilling ~12,800 tokens of per-session unique context on each request. APC eliminates the system prompt cost, but the per-session turn history must still be computed.

The gap between realistic and worst reveals something important about miss depth. In the realistic workload, cache misses can happen at any turn: a session evicted at turn 3 only re-prefills ~3,000 tokens, while eviction at turn 12 costs ~12,800 tokens. The worst case forces every miss to occur at peak session depth, paying the maximum re-prefill cost on every request. Real traffic patterns produce a distribution of miss depths, which is why realistic latency stays much closer to best than to worst.

The best-case result is striking: with perfectly cached session context, the L4 handles >48 concurrent sessions within a 2-second p99 SLA. The ~6c realistic ceiling is not a GPU compute limitation. It is the cost of per-session context uniqueness: unique turn histories that cannot be shared.

The worst case grows linearly at ~2.1 seconds per additional concurrent session, reaching 102 seconds at 48c. Throughput saturates at 0.32 req/s from 6c onward. The GPU is fully consumed re-prefilling 12,800 tokens per request. Additional concurrency just lengthens the queue.

The math

Why the knee is where it is

The L4 has 24 GB of VRAM, but far less than that is available for KV cache. The GPU memory that actually holds KV cache is roughly half of raw VRAM.

Where the memory goes

vLLM's gpu_memory_utilization was set to 0.9 for the fp8 and fp16 experiments, claiming ~21.6 GB. After model weights, CUDA graph capture, activation tensors, and block table overhead, approximately 13 GB remains for KV cache. The turboquant experiment used 0.8 (needs ~2 GB extra scratch for torch.inductor autotuning at startup), leaving approximately 10.6 GB for KV cache.

KV cache per token

Qwen3-4B uses GQA with 36 layers, 8 KV heads, and head_dim 128. The per-token KV cache size depends on precision:

2 (K+V) × 36 layers × 8 KV heads × 128 head_dim × bytes_per_element

FP16/BF16: ... × 2 bytes = 147,456 bytes/token  → ~89K tokens in ~13 GB
FP8:       ... × 1 byte  =  73,728 bytes/token  → ~178K tokens in ~13 GB
TQ4:       ~0.53 B effective (4-bit + quantization metadata)
           =  ~38,700 bytes/token  → ~275K tokens in ~10.6 GB

The capacity arithmetic (with APC)

With APC, the 10,000-token system prompt is stored once and shared. Only the per-session unique context (~12,800 tokens at peak depth) needs its own blocks:

FP8 KV

Token budget:     ~178K
Shared prefix:     10K (1×)
Available:        ~168K
Per-session:      ~12.8K
——————————
Max sessions:   168K / 12.8K ≈ 13
Observed knee:  ~14c

FP16 KV

Token budget:      ~89K
Shared prefix:     10K (1×)
Available:         ~79K
Per-session:      ~12.8K
——————————
Max sessions:   79K / 12.8K ≈ 6
Observed knee:  ~8c

TQ4 KV (estimated)

Token budget:     ~275K (0.8 util)
Shared prefix:     10K (1×)
Available:        ~265K
Per-session:      ~12.8K
——————————
Max sessions:   265K / 12.8K ≈ 21
Predicted knee: ~23c

The arithmetic predicts the knees within 1–2 sessions of the observed values. The slight overshoot (observed 14c vs predicted 13) is because sessions are not all at peak depth simultaneously. Earlier turns have smaller contexts, buying a few extra sessions before capacity is exhausted.

In this setup, the concurrency limit is primarily a memory capacity problem. The binding constraint is how many sessions' KV caches fit in VRAM simultaneously. The best-case bound supports this: with perfect caching, the same GPU handles >48 sessions within 2s p99. Compute, scheduling, and continuous batching effects also contribute, but memory capacity sets the ceiling.

The pattern

How to find the knee for your workload

The knee location depends on three variables:

Available KV cache memory = total VRAM − model weights − CUDA graphs − activations − fragmentation. Typically ~50% of raw VRAM. vLLM reports the exact number at startup.
Per-session unique context = total session tokens at peak depth, minus any shared prefix cached by APC.
KV precision = bytes per element. FP16 is 2× fp8, which is ~2× 4-bit. Each halving of precision roughly doubles token capacity and shifts the knee right.

Max concurrent sessions ≈ (token capacity − shared prefix) / per-session unique context.

For this setup (Qwen3-4B, L4, 22.8K-token agentic sessions with 10K shared prefix), the arithmetic predicts ~13 sessions (fp8) and ~6 (fp16). The observed knees are ~14c and ~8c. The arithmetic gives a first-order estimate. A concurrency sweep gives the precise number. The gap between estimate and observation comes from session depth staggering, block fragmentation, and APC reuse patterns.

Different workloads shift each variable. A single-turn QA workload with 2K tokens per session will have a much higher knee. A code review agent with 50K-token inputs will have a much lower one. A GPU with more VRAM (A100, H100) raises the total budget. But the method is the same: estimate the budget, divide by per-session cost, then verify with a sweep.

Implications

What this means for deployment

Know your KV budget before you set your concurrency limit. Running below the knee gives you the best throughput with stable latency and effective caching. Running above it gives you worse throughput, worse latency, and wastes the APC investment.

KV quantization is a direct concurrency multiplier. On this L4, switching from fp16 to fp8 KV cache moves the 2s p99 SLA ceiling from ~4c to ~6c (50% more sessions) and the eviction knee from ~8c to ~14c (75% more sessions). The capacity gain is a direct consequence of halving the bytes per KV element. Quantization buys memory, and memory buys concurrency.

Monitor tail latency, not averages. After the cliff, p50 looks manageable while p99 is catastrophic. The bimodal distribution means some users get sub-second responses while others wait 40+ seconds. An average-based dashboard will hide this until users complain.

This is also where the background workload distinction from Post 01 becomes concrete. If some sessions are latency-tolerant background work, they can run without competing for cache memory, freeing KV budget for the interactive sessions that benefit most from low TTFT.

Caveats. These experiments use synthetic token content, not real code. The workload has a fixed 12-turn structure; real agent sessions have highly variable depth. Poisson arrivals do not capture bursty agentic traffic (agents send follow-up requests immediately). p99 at high concurrency is noisy: with ~200 requests per run, p99 is the 2nd-worst request. Chunked prefill (not enabled here) could smooth the knee transition. Results are for a single L4 with Qwen3-4B; larger models, multi-GPU setups, and different context lengths will shift the absolute numbers while the pattern holds.

The series has argued that KV cache is a systems problem. The concurrency knee is where that argument meets a specific GPU, a specific model, and a specific workload shape. The math is simple. The discipline is running it before production tells you the hard way.