In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) agents suffer from high inference costs, while fine-tuning or manual prompt engineering introduces significant development overhead. Method: This paper proposes a training-free, lightweight approach—contextual distillation coupled with self-consistent cascading—introducing knowledge distillation into in-context learning for the first time. It dynamically retrieves relevant demonstrations, freezes model parameters during inference, and employs multi-step self-consistency verification to preserve zero-shot generalization capability. Contribution/Results: On ALFWorld, inference cost drops 2.5× ($0.059 → $0.024); on AppWorld, cost halves without accuracy degradation. At million-scale deployment, projected savings exceed $34,000. The method breaks the conventional paradigm reliant on fine-tuning or handcrafted prompts, offering a novel pathway toward efficient, scalable LLM agent deployment.

Technology Category

Application Category

📝 Abstract
The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $ extit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $ extit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $ extbf{2.5$ imes$ lower cost}$, reducing per-episode costs from $0.059 to $0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding $34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $ extbf{2$ imes$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM agent inference costs without fine-tuning
Enabling cost-effective imitation of teacher models via in-context distillation
Achieving accuracy with lower cost using self-consistency cascades
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context distillation adapts knowledge distillation to in-context learning
Self-consistency cascades determine when to trust the student model
Method reduces LLM agent costs without fine-tuning or prompt engineering
🔎 Similar Papers
No similar papers found.
V
Vishnu Sarukkai
Stanford University
A
Asanshay Gupta
Stanford University
J
James Hong
Reve
M
Michael Gharbi
Reve
Kayvon Fatahalian
Kayvon Fatahalian
Associate Professor of Computer Science, Stanford University
Computer GraphicsSystems