KV Cache for LLM Inference

Prefill is turning inference into a distributed systems problem. KV cache is the object at the center of it.

Prefill and decode have fundamentally different compute profiles. The moment you separate them, KV cache becomes the object that carries state across the boundary. That is the same structural shift that happened when Snowflake separated storage and compute: what was colocated becomes independently addressable, connected by a transfer layer.

Inference is becoming a distributed systems problem. This series explores the tradeoffs in caching, transfer, eviction, and capacity planning that come with that shift.

01

Why KV Caching Looks Wrong Until It Suddenly Looks Obvious

The systems-level argument. Why the single-request case is misleading, and why the KV cache object is shrinking faster than most people realize.

02

Disaggregated Prefill and Decode

Prefill and decode have different compute profiles. Separating them turns KV cache from an implementation detail into a distributed systems primitive.

03

vLLM's Hash Chain and Why Prefix Caching Is Still Prefix Caching

How vLLM and SGLang implement prefix caching. The hash-chain technique, the radix tree, why both are prefix-bound, and what CacheBlend could change.

04

Realistic Documents for KV Cache Benchmarks

Why hi hi hi gives misleading results. Compression behavior with real weights, bit shuffling, and building a reproducible corpus for offloading experiments.

05

Finding the Concurrency Knee on an L4 GPU

Qwen3-4B FP8 multi-turn workload on a single L4. Where concurrency stops scaling, why it happens at C=16, and how to reason about the memory budget.

06

On Real-time Inference

Agent loops, voice AI, and streaming UX need the LLM itself to be fast. Specialized hardware, serving frameworks, and speculative decoding are converging on sub-200ms inference.

07

Routing Between Model Tiers

Search, fraud, code completion, and agent routing already run multi-stage inference cascades. The LLM is the most expensive stage in a pipeline that invokes it as rarely as possible.

08

When Logistic Regression Outperforms the LLM

Google's SIGMOD proxy model paper, 329x latency reduction, and why simpler models sometimes beat their teacher. Plus where small LLMs fill the gap that classifiers cannot.