2026-05-24 // KV cache series // kvbenchdocs

On Real-time Inference

Agent loops, voice AI, and streaming UX all need the LLM itself to be fast. The techniques converging on sub-200ms inference, and where each one matters.

When the LLM is in the hot path

A single LLM call to a frontier model takes 200-2,000 ms. For batch workloads like classification, extraction, and analysis, that latency is invisible. The job runs overnight, nobody watches the spinner.

But a growing class of workloads puts the LLM in the interactive path. Agent systems that call the model 10-50 times per task. Voice assistants where TTFT maps to conversational fluency. Streaming chat where a 2-second pause feels broken.

The latency problem has three levers:

1. Make the LLM faster (hardware, serving optimizations, speculative decoding).

2. Use a smaller, fine-tuned model that meets the quality bar at lower latency.

3. Call the model less often (tiered inference, routing, cascades).

Production systems often combine all three.

Where milliseconds compound

Agent reasoning loops

Coding agents make 25-34 sequential LLM calls per task. An empirical study of SWE-Bench trajectories found OpenHands averages 29 iterations per issue; RepairAgent averages 34. Manus AI reports 30-50 tool calls on typical tasks. Each call is sequential because step N depends on step N-1's output.

At 2 seconds per call, 25 steps = 50 seconds of pure inference wait. At 200ms per call, the same task takes 5 seconds.

For general-purpose models, smaller means more steps. Anthropic's data shows Opus solves in 4 iterations where Sonnet needs 10. But that compares out-of-the-box models on arbitrary tasks. A fine-tuned 3-8B model trained on a specific workflow (code review, support triage, a particular tool-calling pattern) can match a frontier model's step count at a fraction of the per-call latency. The narrower the task, the more a specialized smaller model wins.

Voice and conversational AI

Human conversational turn-taking operates on a 200-300ms window. Exceed it and the interaction feels artificial. AssemblyAI calls 300ms the hard threshold: a 95%-accurate model responding in 300ms beats a 98%-accurate model at 2 seconds.

The voice pipeline budget is tight. STT takes 100-200ms. TTS takes 75-200ms. The LLM sits in the middle and accounts for roughly 70% of total latency. Cutting LLM TTFT to sub-200ms is the only way to hit a sub-800ms end-to-end target.

Each additional second of latency degrades customer satisfaction by 16%. Contact centers report 40% higher hang-up rates when voice agents exceed 1 second. OpenAI's GPT-4o responds to audio in as little as 232 milliseconds, averaging 320ms. That is the benchmark.

Streaming chat and TTFT

Users perceive delays at ~300ms and consciously notice them at ~500ms. Character.AI found faster models drove up to 8.8% improvement in engagement breadth and 19.4% in engagement depth across eight deployments.

The effect is not transient. Google's foundational latency study found that a 200ms delay caused 0.36% fewer searches over six weeks. Even after removing the delay, users took weeks to return to baseline. Latency damage is sticky. Amazon found that every 100ms of added latency can cost 1% in revenue.

Real-time code generation

Code completion already demonstrates the tiered approach: GitHub Copilot uses a small custom model for fast inline suggestions and escalates to a larger model for complex multi-line cases. GitHub's controlled study found Copilot users completed tasks 55% faster (1h11m vs 2h41m), with 73% saying it helped them stay in flow. Google's internal study across 10,000+ engineers showed a 6% reduction in coding iteration time. Both results depend on suggestions arriving within the latency budget. Flow state is fragile.

Agent-mode tasks (code review, refactoring, test generation, multi-file edits) require reasoning across files and make sequential LLM calls, same as the agent loop problem. Today these use larger models, but fine-tuned smaller models are viable for well-scoped tasks. Cursor's "Tab" model (trained specifically for predicting edits) is one example.

Use caseLatency targetLLM calls per taskWhy it needs the LLM
Agent loop200ms / call25-50Multi-step reasoning, tool use
Voice assistant<300ms TTFT1Intent, context, generation
Streaming chat<500ms TTFT1Reasoning, long-form output
Code agent mode200ms / call10-30Full-file reasoning, edits

Latency compounds. 25 agent steps, a voice pipeline budget, developer flow state: each millisecond of per-call latency multiplies through the system.

Closing the latency gap

Specialized inference silicon

A new class of hardware bets on specialization: trade flexibility for raw speed. Each takes a different approach to bypassing the memory bandwidth wall that limits GPU inference.

VendorApproachSpeed
Taalas HC1Llama 3.1 8B baked into silicon at fabrication. One model, forever.17,000 tok/s*
CerebrasWafer-scale compute. Full models on a single wafer.2,522 tok/s (Maverick)
969 tok/s (405B)
GroqLPU with on-chip SRAM. Deterministic execution.~1,300 tok/s (70B)

* Taalas numbers are vendor-reported. Cerebras is independently verified by Artificial Analysis. Groq has partial independent benchmarks.

For the 25-step agent loop above, cutting per-call latency from 500ms to 50ms turns a 12-second workflow into a 1-second workflow. For voice AI, specialized silicon could put the LLM below the 300ms perception threshold.

Serving frameworks

NVIDIA Dynamo separates prefill and decode onto different GPU pools, routing requests based on KV cache locality. vLLM and SGLang add prefix caching, chunked prefill, and continuous batching. All three reduce per-request latency by making each LLM call cheaper.

Speculative decoding

A small draft model proposes tokens; the large model verifies them in a single forward pass. Fewer forward passes per output token. Academic benchmarks report 4-6x speedups (EAGLE-3 at temperature 0), but production lands at 1.5-3x. For voice AI targeting 200ms TTFT, even 2x matters.

CPU-side inference for small models

For models under ~4B parameters, GPU dispatch overhead can exceed the compute itself. llama.cpp and ONNX Runtime run guardrail checks (5-30 ms) and classifiers (<1 ms) on CPU, eliminating the GPU roundtrip entirely.

Calling the LLM less often

For many workloads, not every input needs the LLM. Search pipelines cascade through BM25, embedding retrieval, cross-encoder reranking, and the LLM, each stage filtering for the next. Fraud scoring runs gradient-boosted trees at 1ms per transaction, reserving the LLM for ambiguous cases.

Routing between model tiers explores how production systems cascade through increasingly expensive models. When logistic regression outperforms the LLM covers Google's SIGMOD paper that replaces LLM calls with a linear classifier and gets 329x latency reduction.