Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes LLM Shepherding, a novel framework that addresses the high inference cost of large language models (LLMs) and the limited accuracy of small language models (SLMs) by introducing a token-level budget-aware collaboration mechanism. Unlike existing routing or cascading approaches that lack fine-grained cost control, LLM Shepherding employs a two-stage predictor to dynamically determine whether—and to what extent—to invoke an LLM for generating a short, adaptive prompt prefix, which then guides the SLM to complete the full response. Evaluated on benchmarks including GSM8K, CNK12, HumanEval, and MBPP, the method reduces inference costs by 42%–94% compared to pure LLM inference while maintaining comparable accuracy, achieving up to 2.8× higher cost efficiency than state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
Problem

Research questions and friction points this paper is trying to address.

cost-efficient inference
large language models
small language models
token-level budget control
LLM-SLM collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM Shepherding
cost-efficient inference
token-level budget control
small language models
hint-based prompting
🔎 Similar Papers
No similar papers found.
Z
Ziming Dong
Department of Computer Science, University of Victoria, Victoria, Canada
Hardik Sharma
Hardik Sharma
Google
Deep LearningComputer ArchitectureHardware AccelerationApproximate Computing
E
Evan O'Toole
Department of Computer Science, University of Victoria, Victoria, Canada
J
J. Champati
Department of Computer Science, University of Victoria, Victoria, Canada
Kui Wu
Kui Wu
Professor of Computer Science, University of Victoria
Sensor NetworksNetwork TomographyNetwork CalculusComputational SustainabilitySecurity and Intrusion Detection