The case for always calling the LLM
On Real-time Inference covered how to make the LLM faster: specialized hardware, serving optimizations, speculative decoding. This post covers the other lever: calling the LLM less often.
The default architecture for semantic work: send text to an LLM, get a result back. One model, one API, zero training infrastructure. The prompt is the program. Arguably defensible when accuracy matters more than latency or cost.
But the cost structure scales badly.
A single LLM call takes 200-2,000 ms. At $0.50-$15.00 per million input tokens, running it on every row in a 10-million-row table costs thousands of dollars and takes hours. The LLM processes the same number of tokens whether the answer is obvious or ambiguous. Sentiment classification where 90% of examples are unambiguous costs the same per row as the 10% that genuinely need reasoning.
The LLM-for-everything architecture pays the cost of the hardest case on every case. Most real workloads have a bimodal difficulty distribution: a large fraction of easy examples and a smaller fraction that actually need the full model.
Same pattern as adaptive query execution in databases. If a simpler strategy suffices, use it. In inference, the "simpler strategy" can be an entirely different model.
Five tiers of inference
Models ordered by cost, latency, and capability:
regression
XGBoost
encoder
(1-8B)
(70B+)
The question is where a given input should land. Easy inputs (obvious sentiment, clear spam) stay left. Hard inputs (ambiguous categories, reasoning required) escalate right. Try the cheapest model first; if confidence is too low, pass to the next tier.
Search ranking has worked this way for over a decade: BM25 retrieves candidates, a cross-encoder reranks the top N, and sometimes a larger model re-scores the top K. The LLM is the most expensive stage in a pipeline that should invoke it as rarely as possible.
Cascading filters with increasing cost is a well-understood systems pattern. Adaptive inference applies it to LLM workloads. What is new is that the LLM can generate the training data for the cheaper stages automatically.
Production systems that already do this
Search and recommendation
Meta's recommendation pipeline runs in ~100 ms: a two-tower embedding model retrieves ~10,000 candidates (item embeddings precomputed offline), a lightweight scorer filters to ~500, and a deep ranking model scores the final set. The LLM enters only for query understanding or complex relevance judgments. Cheaper models handle everything else.
Fraud and risk scoring
Stripe Radar runs within a ~100 ms authorization window, with model inference itself taking 1-2 ms. The bottleneck is feature assembly, not the model. Visa processes each transaction in approximately one millisecond, evaluating over 500 risk attributes, preventing ~$25 billion in fraud using AI. The models are gradient-boosted trees or neural networks. LLMs enter offline: generating training labels, or analyzing flagged transactions that need deeper reasoning.
Code completion
GitHub Copilot serves over 400 million completion requests per day with a sub-200 ms target. A smaller custom model handles fast single-line suggestions; a larger model handles complex multi-line cases. Cursor's speculative edits push this further: the original file acts as a draft, unchanged chunks are accepted in bulk, and only modified regions need generation -- roughly 1,000 tokens per second effective throughput on a 70B model.
Agent routing
Agent frameworks face a model selection problem on every step: which tool to call, which model to use, whether to retrieve context. Some routing decisions are simple enough for a classifier. Anthropic routes easy queries to smaller models like Haiku; RouteLLM achieves 95% of GPT-4 quality while sending only 14-26% of queries to the frontier model. The router itself is a proxy: a cheap model that avoids expensive inference when the answer would be the same.
| Domain | Latency budget | Model inference | What runs in the fast path |
|---|---|---|---|
| Search / rec | ~100 ms | 1-10 ms per stage | Two-tower embeddings, lightweight scorers |
| Fraud | ~100 ms | 1-2 ms | Gradient-boosted trees, neural nets |
| Code completion | ~200 ms | 50-200 ms | Small custom models, speculative edits |
| Agent routing | <50 µs | negligible | Classifiers over cached metadata |
| Content moderation | <50 ms | <1 ms classifier | NLP classifiers; LLM for ambiguous tail |
The common structure: a fast, cheap model handles 85-95% of inputs. The expensive model handles the uncertain residual. The ratio varies by domain, but the architecture is the same.
How far can you push it?
The cascade pattern is clear: try the cheapest model first, escalate if confidence is low. The question that follows is concrete: for a given task, what is the cheapest tier that produces acceptable quality? And can the LLM itself generate the training data that makes cheaper tiers viable?
A Google team answered both questions for analytical SQL queries, replacing LLM calls with logistic regression over precomputed embeddings. 329x latency reduction on 10-million-row tables. The proxy sometimes outperforms the LLM on F1. When logistic regression outperforms the LLM covers the paper in detail: how it works, where it fails, and why that happens.