Routing Between Model Tiers

The default assumption

The case for always calling the LLM

On Real-time Inference covered how to make the LLM faster: specialized hardware, serving optimizations, speculative decoding. This post covers the other lever: calling the LLM less often.

The default architecture for semantic work: send text to an LLM, get a result back. One model, one API, zero training infrastructure. The prompt is the program. Arguably defensible when accuracy matters more than latency or cost.

But the cost structure scales badly.

A single LLM call takes 200-2,000 ms. At $0.50-$15.00 per million input tokens, running it on every row in a 10-million-row table costs thousands of dollars and takes hours. The LLM processes the same number of tokens whether the answer is obvious or ambiguous. Sentiment classification where 90% of examples are unambiguous costs the same per row as the 10% that genuinely need reasoning.

The LLM-for-everything architecture pays the cost of the hardest case on every case. Most real workloads have a bimodal difficulty distribution: a large fraction of easy examples and a smaller fraction that actually need the full model.

Same pattern as adaptive query execution in databases. If a simpler strategy suffices, use it. In inference, the "simpler strategy" can be an entirely different model.

The spectrum

Five tiers of inference

Models ordered by cost, latency, and capability:

Cheapest

Logistic
regression

µs / row

SVM /
XGBoost

µs / row

Cross-
encoder

ms / row

Small LLM
(1-8B)

10s ms / row

Costliest

Full LLM
(70B+)

100s ms / row

The question is where a given input should land. Easy inputs (obvious sentiment, clear spam) stay left. Hard inputs (ambiguous categories, reasoning required) escalate right. Try the cheapest model first; if confidence is too low, pass to the next tier.

Search ranking has worked this way for over a decade: BM25 retrieves candidates, a cross-encoder reranks the top N, and sometimes a larger model re-scores the top K. The LLM is the most expensive stage in a pipeline that should invoke it as rarely as possible.

Cascading filters with increasing cost is a well-understood systems pattern. Adaptive inference applies it to LLM workloads. What is new is that the LLM can generate the training data for the cheaper stages automatically.

Evidence

Production systems that already do this

Search and recommendation

Meta's recommendation pipeline runs in ~100 ms: a two-tower embedding model retrieves ~10,000 candidates (item embeddings precomputed offline), a lightweight scorer filters to ~500, and a deep ranking model scores the final set. The LLM enters only for query understanding or complex relevance judgments. Cheaper models handle everything else.

Fraud and risk scoring

Stripe Radar runs within a ~100 ms authorization window, with model inference itself taking 1-2 ms. The bottleneck is feature assembly, not the model. Visa processes each transaction in approximately one millisecond, evaluating over 500 risk attributes, preventing ~$25 billion in fraud using AI. The models are gradient-boosted trees or neural networks. LLMs enter offline: generating training labels, or analyzing flagged transactions that need deeper reasoning.

Code completion

GitHub Copilot serves over 400 million completion requests per day with a sub-200 ms target. A smaller custom model handles fast single-line suggestions; a larger model handles complex multi-line cases. Cursor's speculative edits push this further: the original file acts as a draft, unchanged chunks are accepted in bulk, and only modified regions need generation -- roughly 1,000 tokens per second effective throughput on a 70B model.

Agent routing

Agent frameworks face a model selection problem on every step: which tool to call, which model to use, whether to retrieve context. Some routing decisions are simple enough for a classifier. Anthropic routes easy queries to smaller models like Haiku; RouteLLM achieves 95% of GPT-4 quality while sending only 14-26% of queries to the frontier model. The router itself is a proxy: a cheap model that avoids expensive inference when the answer would be the same.

Domain	Latency budget	Model inference	What runs in the fast path
Search / rec	~100 ms	1-10 ms per stage	Two-tower embeddings, lightweight scorers
Fraud	~100 ms	1-2 ms	Gradient-boosted trees, neural nets
Code completion	~200 ms	50-200 ms	Small custom models, speculative edits
Agent routing	<50 µs	negligible	Classifiers over cached metadata
Content moderation	<50 ms	<1 ms classifier	NLP classifiers; LLM for ambiguous tail

The common structure: a fast, cheap model handles 85-95% of inputs. The expensive model handles the uncertain residual. The ratio varies by domain, but the architecture is the same.

The question

How far can you push it?

The cascade pattern is clear: try the cheapest model first, escalate if confidence is low. The question that follows is concrete: for a given task, what is the cheapest tier that produces acceptable quality? And can the LLM itself generate the training data that makes cheaper tiers viable?

A Google team answered both questions for analytical SQL queries, replacing LLM calls with logistic regression over precomputed embeddings. 329x latency reduction on 10-million-row tables. The proxy sometimes outperforms the LLM on F1. When logistic regression outperforms the LLM covers the paper in detail: how it works, where it fails, and why that happens.