When Less is Enough: Efficient Inference via Collaborative Reasoning

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This work addresses the high computational cost of end-to-end reasoning with a single large language model, which often struggles to balance performance and efficiency. To this end, the authors propose DUET, a dual-model collaborative inference framework: in the first stage, a large model generates a concise reasoning signal; in the second stage, a lightweight model produces the final answer based on this signal. A joint training objective incorporating length penalty encourages the large model to transmit only the minimal information necessary for solving the task, thereby enabling efficient collaboration. Evaluated on challenging reasoning benchmarks such as AIME and GPQA, DUET maintains strong performance while reducing the number of output tokens from the large model by up to 60%.
📝 Abstract
In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to perform end-to-end reasoning and prediction often incurs substantial inference cost. In contrast, DUET decomposes inference into two stages: the capable model produces a reasoning signal, and the lightweight model interprets this signal to generate the final answer, allowing reasoning-intensive computation to be handled by the capable model while non-reasoning-intensive components are delegated to the lightweight model without sacrificing task performance. To achieve this objective, we propose a length-penalized joint training objective that encourages the capable model to transmit only the information that is sufficient for the lightweight model to solve the task. As a result, DUET maintains strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model's output tokens on challenging reasoning benchmarks, including AIME and GPQA.
Problem

Research questions and friction points this paper is trying to address.

efficient inference
collaborative reasoning
reasoning cost
large language models
inference optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative reasoning
efficient inference
two-stage inference
length-penalized training
model collaboration