🤖 AI Summary
Existing agent collaboration frameworks lack clarity in how performance is enhanced on complex reasoning tasks and often rely excessively on external orchestration. This work addresses this gap by formally conceptualizing “rethinking” as a learnable, endogenous skill and introduces a two-stage architecture—comprising parallel reasoning followed by summarization—that can be integrated into any collaborative framework to drive complex problem solving. The rethinking skill is optimized via reinforcement learning, enabling both scalability and internalization. Experimental results demonstrate that the proposed approach consistently outperforms Best-of-N baselines across multiple domains, approaches the performance of Pass@N with strong base models, and confirms that both the depth and breadth of rethinking can be further enhanced through training.
📝 Abstract
Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.