Prefill and decode have fundamentally different compute profiles. The moment you separate them, KV cache becomes the object that carries state across the boundary. That is the same structural shift that happened when Snowflake separated storage and compute: what was colocated becomes independently addressable, connected by a transfer layer.
Inference is becoming a distributed systems problem. This series explores the tradeoffs in caching, transfer, eviction, and capacity planning that come with that shift.
Why KV Caching Looks Wrong Until It Suddenly Looks Obvious
The systems-level argument. Why the single-request case is misleading, and why the KV cache object is shrinking faster than most people realize.
Disaggregated Prefill and Decode
Prefill and decode have different compute profiles. Separating them turns KV cache from an implementation detail into a distributed systems primitive.
vLLM's Hash Chain and Why Prefix Caching Is Still Prefix Caching
How vLLM and SGLang implement prefix caching. The hash-chain technique, the radix tree, why both are prefix-bound, and what CacheBlend could change.
Realistic Documents for KV Cache Benchmarks
Why hi hi hi gives misleading results. Compression behavior with real weights, bit shuffling, and building a reproducible corpus for offloading experiments.
Finding the Concurrency Knee on an L4 GPU
Qwen3-4B FP8 multi-turn workload on a single L4. Where concurrency stops scaling, why it happens at C=16, and how to reason about the memory budget.
On Real-time Inference
Agent loops, voice AI, and streaming UX need the LLM itself to be fast. Specialized hardware, serving frameworks, and speculative decoding are converging on sub-200ms inference.
Routing Between Model Tiers
Search, fraud, code completion, and agent routing already run multi-stage inference cascades. The LLM is the most expensive stage in a pipeline that invokes it as rarely as possible.
When Logistic Regression Outperforms the LLM
Google's SIGMOD proxy model paper, 329x latency reduction, and why simpler models sometimes beat their teacher. Plus where small LLMs fill the gap that classifiers cannot.