🤖 AI Summary
This work addresses the challenge of inertial reasoning in large reasoning models when handling complex tool-use tasks, often stemming from an inability to decompose tasks into manageable subtasks. To overcome this limitation, the authors propose D-CORE, a two-stage training framework that first employs self-distillation to elicit intrinsic task decomposition capabilities and then applies diversity-aware reinforcement learning to restore and enhance reflective reasoning. This study is the first to integrate explicit task decomposition with a compositional reasoning mechanism into the training of large reasoning models, significantly improving generalization. Experimental results demonstrate that D-CORE-8B achieves 77.7% accuracy on BFCLv3, outperforming the previous best 8B model by 5.7%, while D-CORE-14B sets a new state-of-the-art with 79.3% accuracy, surpassing even 70B-scale models.
📝 Abstract
Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs'task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs'reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.