Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the high computational cost of large language models (LLMs) in complex reasoning tasks, where generating long sequences hinders the balance between efficiency and performance. To this end, the authors propose Tandem, a novel framework in which an LLM acts as a strategic coordinator to produce key reasoning cues that guide a smaller model in completing the full reasoning process. A cost-aware termination mechanism dynamically controls the LLM’s generation length, while a transferable sufficiency classifier—designed for zero-shot cross-domain adaptation—enables efficient collaboration between the large and small models for the first time. Experiments demonstrate that Tandem reduces computational costs by approximately 40% compared to using the LLM alone on mathematical reasoning and code generation tasks, while maintaining or even surpassing the original performance.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.

Problem

Research questions and friction points this paper is trying to address.

reasoning-intensive inference

computational overhead

large language models

efficient reasoning

model collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tandem

large language models

small language models