In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Large language model (LLM) agents suffer from high inference costs, while fine-tuning or manual prompt engineering introduces significant development overhead. Method: This paper proposes a training-free, lightweight approach—contextual distillation coupled with self-consistent cascading—introducing knowledge distillation into in-context learning for the first time. It dynamically retrieves relevant demonstrations, freezes model parameters during inference, and employs multi-step self-consistency verification to preserve zero-shot generalization capability. Contribution/Results: On ALFWorld, inference cost drops 2.5× ($0.059 → $0.024); on AppWorld, cost halves without accuracy degradation. At million-scale deployment, projected savings exceed $34,000. The method breaks the conventional paradigm reliant on fine-tuning or handcrafted prompts, offering a novel pathway toward efficient, scalable LLM agent deployment.

Technology Category

Application Category

📝 Abstract

The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $ extit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $ extit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $ extbf{2.5$ imes$ lower cost}$, reducing per-episode costs from $0.059 to $0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding $34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $ extbf{2$ imes$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM agent inference costs without fine-tuning

Enabling cost-effective imitation of teacher models via in-context distillation

Achieving accuracy with lower cost using self-consistency cascades

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context distillation adapts knowledge distillation to in-context learning

Self-consistency cascades determine when to trust the student model

Method reduces LLM agent costs without fine-tuning or prompt engineering

🔎 Similar Papers

Efficient LLM Context Distillation