Hawkeye:Efficient Reasoning with Model Collaboration

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from redundant intermediate reasoning steps, low semantic density, and high computational overhead in chain-of-thought (CoT) inference. To address these issues, we propose a collaborative “instruction refinement–response execution” paradigm: an LLM first generates compact, high-information-density reasoning instructions, which are then executed efficiently by a smaller model to produce the final response. We introduce a reinforcement learning–based framework for quantifying CoT redundancy and distilling high-density information, jointly optimized during both post-training and inference. Experiments demonstrate that our method retains equivalent response quality using only 35% of the original CoT tokens, improves clarity, coherence, and conciseness by approximately 10%, achieves a 3.4× end-to-end speedup on complex mathematical tasks, and reduces inference cost by 60%.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning has demonstrated remarkable effectiveness in enhancing the reasoning abilities of large language models (LLMs). However, its efficiency remains a challenge due to the generation of excessive intermediate reasoning tokens, which introduce semantic redundancy and overly detailed reasoning steps. Moreover, computational expense and latency are significant concerns, as the cost scales with the number of output tokens, including those intermediate steps. In this work, we observe that most CoT tokens are unnecessary, and retaining only a small portion of them is sufficient for producing high-quality responses. Inspired by this, we propose HAWKEYE, a novel post-training and inference framework where a large model produces concise CoT instructions to guide a smaller model in response generation. HAWKEYE quantifies redundancy in CoT reasoning and distills high-density information via reinforcement learning. By leveraging these concise CoTs, HAWKEYE is able to expand responses while reducing token usage and computational cost significantly. Our evaluation shows that HAWKEYE can achieve comparable response quality using only 35% of the full CoTs, while improving clarity, coherence, and conciseness by approximately 10%. Furthermore, HAWKEYE can accelerate end-to-end reasoning by up to 3.4x on complex math tasks while reducing inference cost by up to 60%. HAWKEYE will be open-sourced and the models will be available soon.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant tokens in Chain-of-Thought reasoning
Lowers computational cost and latency in LLMs
Improves reasoning efficiency with concise CoT instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large model generates concise CoT instructions
Reinforcement learning distills high-density information
Reduces token usage and computational cost significantly
J
Jianshu She
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Z
Zhuohao Li
University of California, Los Angeles
Zhemin Huang
Zhemin Huang
Stanford University
Machine Learning SystemsLarge Language Model
Q
Qi Li
Independent Researcher
P
Peiran Xu
University of California, Los Angeles
H
Haonan Li
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Qirong Ho
Qirong Ho
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and Petuum, Inc