When Logistic Regression Outperforms the LLM

The paper

Logistic regression instead of an LLM

Routing between model tiers covered the pattern: production systems route inputs to the cheapest model that handles them. 100x Faster, Cheaper, and Better (Chung et al., SIGMOD 2026) takes this to its logical extreme. The cheapest model that works is not a smaller LLM. It is logistic regression.

The system runs in BigQuery and AlloyDB. It handles three SQL operators that require semantic understanding: AI.IF (binary classification), AI.CLASSIFY (multiclass), and AI.RANK (semantic ranking). Given a query like:

SELECT * FROM reviews WHERE AI.IF(review, 'is positive') = TRUE

The pipeline is fully automated: sample 1,000 rows, label them with Gemini, train logistic regression over precomputed text embeddings, evaluate quality on a held-out split. If the proxy meets a quality threshold, it handles the remaining rows. Otherwise, fall back to the LLM. No ML expertise required. No hyperparameter tuning. Default sklearn parameters.

Training takes under a minute on 1,000 samples. At 10M rows, that training cost is 0.01% of the table. Everything after the first thousand rows runs as a matrix multiply on CPU.

The numbers

329x latency, 728x cost

Scenario	Latency reduction	Cost reduction	Condition
10M rows, precomputed embeddings (online)	329x	728x	Embeddings already in storage
10M rows, precomputed embeddings (offline)	991x	792x	Pre-trained proxy, AlloyDB
10M rows, embeddings not precomputed	~5x	~2.5x	Must compute embeddings at query time
100K rows, precomputed	6x	81x	Training overhead less amortized

The headline numbers require precomputed embeddings. When embeddings are already stored alongside the data, the proxy at query time is logistic regression over vectors already in memory. No tokenization, no attention, no GPU.

Without precomputed embeddings, the advantage drops to ~5x latency and ~2.5x cost. Still meaningful, but a different category of win. The gap between 5x and 329x is the difference between "embeddings as a runtime step" and "embeddings as data infrastructure."

Latency speedup (proxy vs LLM) by table size. Green: online with precomputed embeddings. Purple dashed: offline batch. Yellow dashed: without precomputed embeddings. Log scale on y-axis. Data from Chung et al. SIGMOD 2026, Tables 6-7.

The speed comes from avoiding the LLM call entirely. The proxy replaces the LLM with a model that runs on CPU in microseconds per row. Hardware accelerators, KV caching, batching optimizations: none of them close a 329x gap. The gap exists because one path never calls the LLM.

The surprise

The proxy sometimes beats the teacher

Across 11 benchmark datasets for AI.IF, the proxy achieves F1 ratios (proxy F1 / LLM F1) between 0.901 and 1.163. On 7 of 13 benchmark-prompt pairs, the proxy matches or outperforms Gemini. On Amazon Reviews 10K, the proxy hits 0.860 F1 against the LLM's 0.739: a 16.3% improvement over its own teacher.

F1 ratio (proxy F1 / LLM F1) across benchmark-prompt pairs for AI.IF. Sorted by ratio. Points above 1.0 = proxy outperforms the LLM. Data from Chung et al. SIGMOD 2026, Table 5.

The most striking result is in the data-slice analysis. When the paper tests a globally-trained proxy on filtered subsets of California Housing data, the proxy outperforms Gemini by 2.44x on one slice (Slice 3: proxy F1 0.244 vs LLM F1 0.100) and 1.85x on another (Slice 1: 0.617 vs 0.334). The globally-trained proxy generalizes to data subsets better than the LLM handles them directly.

F1 scores by data slice on California Housing. The proxy is trained on a global sample, then evaluated on filtered subsets. On 4 of 5 slices, the proxy outperforms the LLM. Data from Chung et al. SIGMOD 2026, Table 14.

Why does this happen?

Three mechanisms compound.

The embedding does the hard work. The proxy's input is a precomputed embedding vector. The embedding model has already projected text into a space optimized for semantic similarity, discarding syntactic noise and formatting artifacts. The classifier operates on a cleaner representation than the LLM ever sees. Information Bottleneck in practice: compress away everything except what matters for the task. Shi et al. (ICML 2023) showed that LLMs are easily distracted by irrelevant context, with accuracy dropping even on trivial tasks when extraneous information is present.

Distillation regularizes. The proxy trains on LLM-generated labels, learning the central tendency of the LLM's decision function while discarding per-instance noise. Furlanello et al. (ICML 2018) showed that even self-distillation, training an identical architecture on its own outputs, consistently outperforms the original.

Determinism compounds. Logistic regression returns the same label for the same input every time. LLMs do not: batch composition, MoE expert routing, and floating-point reduction order create variance even at temperature 0 (Atil et al., 2024). Over thousands of examples, determinism alone produces tighter F1. And the proxy has no output format to get wrong: no "The sentiment is positive" when the parser expects "positive."

The LLM's value shifts from running inference to generating the training labels that make the classifier possible. The LLM is the teacher. It does not need to be the one answering every question.

Design decisions

What makes the engineering compelling

Hyperparameter tuning does not help

The paper's ablation on model selection is remarkable. On Tweet Sentiment, logistic regression with default sklearn parameters (F1: 0.867) matches or beats tuned Random Forest (0.836), XGBoost (0.848), and SVM (0.861). SVM costs 49.84x more training time. XGBoost costs 3.18x. Neither improves accuracy.

F1 and relative training cost on Tweet Sentiment. Logistic regression with default parameters beats all alternatives. Tuning the alternatives does not help either: RF tuned drops to 0.836, XGBoost tuned drops to 0.848. Data from Chung et al. SIGMOD 2026, Table 13.

The reason is simple: with high-quality embeddings, the signal is already in the representation. Additional model complexity adds training cost without improving accuracy. The paper calls it explicitly: "tuning did not result in significant differences."

The embedding model matters more than the classifier

The paper tests three embedding models: Gecko (768D, 18x base cost), Gemini (3072D, 28x base cost), and Gemma (768D, 1x base cost). Gecko at 768 dimensions consistently outperforms Gemini at 3072 dimensions, despite Gemini costing 50% more and producing 4x larger vectors. More dimensions do not mean better embeddings. The quality of the projection matters more than its size.

Gemma (open-source, cheapest) is consistently too weak for reliable proxies. There is a minimum quality threshold below which the proxy approach breaks down. That threshold is well above "cheapest available."

Random sampling works

The paper tests random sampling, Top-K similarity, and Active Learning. Active Learning costs 51,639x more than random sampling. Top-K costs 43,761x more. For balanced datasets, random sampling produces comparable training sets at a fraction of the cost. Active Learning is only justified for extreme class imbalance (ratio >10:1), and even then, the paper's default approach of class-weighted training handles most cases.

Beyond classification

AI.RANK and AI.CLASSIFY

Ranking: where the proxy hits its limits

For AI.RANK, the proxy formulates ranking as binary classification (relevant vs irrelevant) over embeddings. On TREC-COVID (493 relevant docs per query), the proxy achieves 0.535 nDCG@10 against the LLM's 0.551: 97% relative. On TREC-DL-2022 (4-level rubric scoring), it hits 0.446 vs the LLM's 0.537: 83% relative.

But on sparse-relevance datasets, it fails. HellaSwag (1 relevant per 4 options): proxy 0.134 vs LLM 0.247. SciFact (1.1 relevant per query): proxy 0.010 vs LLM 0.508. The training set becomes too imbalanced for the classifier to learn. These cases trigger the automatic fallback.

Cross-attention re-rankers beat both the proxy and the LLM on most IR tasks, at 0.025x the cost of the LLM and 0.11x the latency. The paper suggests integrating re-rankers into the adaptive selection framework as a future direction.

Multiclass: more classes, more samples

For AI.CLASSIFY, the proxy extends to multiclass. BBC News (5 classes, 2.2K rows) achieves 0.95 precision/recall with 1,000 training samples. DBpedia (14 classes) needs 4,000+ samples to reach 0.94/0.95, up from 0.83/0.65 at 1,000 samples. The paper calls this the "needle-in-a-haystack" sampling problem: more classes means more required samples to get sufficient coverage of each.

The middle tier

Where small LLMs fill the gap

Some tasks fall between a linear classifier and a frontier model: too complex for logistic regression, too latency-sensitive for 70B+ parameters. Small LLMs (1-8B, fine-tuned) fill that gap. A classifier outputs a label. A small LLM can generate structured output: explanations, extracted entities, reformulated queries, tool-call decisions.

The distillation dynamic from the Chung et al. paper applies here too. A fine-tuned 1-3B model inherits the frontier model's decision boundary while gaining focus. Wang, Qu & Ye (2024) showed fine-tuned BERT with 200 training samples (71.1% accuracy) matches few-shot GPT (70.2%), with the gap widening with more data. Bucher & Martini (2024) found fine-tuned RoBERTa and DeBERTa beat zero-shot GPT-4 on classification benchmarks.

The practical cascade

Tier 1: Embedding classifier (logistic regression, SVM). Binary and multiclass classification where the decision boundary is clean. Microseconds per input, runs on CPU. Catches 60-80% of requests.
Tier 2: Small LLM (1-8B, fine-tuned). Structured generation, multi-field extraction, nuanced classification. Tens of milliseconds per input.
Tier 3: Frontier LLM (70B+). Open-ended reasoning, ambiguous edge cases, tasks requiring deep context. Hundreds of milliseconds per input. Receives only the hard tail.

Each tier is trained or distilled from the tier above it. Quality flows downward through distillation; traffic flows upward through confidence-based escalation.

Implications

What this changes

Embeddings become infrastructure. The 329x advantage depends entirely on precomputed embeddings. Without them, it drops to 5x. This is the paper's clearest architectural lesson: compute embeddings at ingestion, store alongside the data, update when the embedding model changes. Embeddings are data infrastructure, like an index.

The LLM becomes teacher and backstop. It generates training data for cheaper models and handles the cases they cannot. Both roles are high-value. Neither requires processing every input.

Quality evaluation becomes a runtime concern. The paper's automatic threshold mechanism (deploy proxy if relative accuracy >= 90%, fall back otherwise) makes routing a runtime decision, not a design-time one. Confidence scores must be meaningful, not just high. A model that is confidently wrong is worse than one that escalates.

Cost optimization becomes engineering. The LLM-for-everything approach has one cost lever: negotiate a lower per-token price. The adaptive approach has several: improve the embedding model, precompute more embeddings, add tiers, tune thresholds. Cost moves from procurement to architecture.

Scope. The 329x number requires precomputed embeddings, large tables, and classification-shaped tasks. Generalizing it to arbitrary LLM workloads is wrong. Proxy models work when the task is classification-shaped, embeddings capture the relevant signal, and the data distribution is stable. Outside those boundaries, the LLM is not optional.