When Less is Enough: Efficient Inference via Collaborative Reasoning

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the high computational cost of end-to-end reasoning with a single large language model, which often struggles to balance performance and efficiency. To this end, the authors propose DUET, a dual-model collaborative inference framework: in the first stage, a large model generates a concise reasoning signal; in the second stage, a lightweight model produces the final answer based on this signal. A joint training objective incorporating length penalty encourages the large model to transmit only the minimal information necessary for solving the task, thereby enabling efficient collaboration. Evaluated on challenging reasoning benchmarks such as AIME and GPQA, DUET maintains strong performance while reducing the number of output tokens from the large model by up to 60%.

📝 Abstract

In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to perform end-to-end reasoning and prediction often incurs substantial inference cost. In contrast, DUET decomposes inference into two stages: the capable model produces a reasoning signal, and the lightweight model interprets this signal to generate the final answer, allowing reasoning-intensive computation to be handled by the capable model while non-reasoning-intensive components are delegated to the lightweight model without sacrificing task performance. To achieve this objective, we propose a length-penalized joint training objective that encourages the capable model to transmit only the information that is sufficient for the lightweight model to solve the task. As a result, DUET maintains strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model's output tokens on challenging reasoning benchmarks, including AIME and GPQA.

Problem

Research questions and friction points this paper is trying to address.

efficient inference

collaborative reasoning

reasoning cost

large language models

inference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative reasoning

efficient inference

two-stage inference