🤖 AI Summary
Existing reinforcement learning approaches for enhancing reasoning capabilities in large language models suffer from low sample efficiency, high training costs, inaccurate difficulty estimation, and suboptimal inference efficiency. This work proposes DARE, a unified framework that, for the first time, co-evolves difficulty estimation with policy training. DARE maintains task difficulty diversity through self-normalized importance sampling and a symmetric Beta distribution, and introduces a difficulty-aware hierarchical adaptive computation allocation mechanism. The method substantially improves both training and inference efficiency, consistently outperforming existing approaches across multiple models and domains: it not only achieves superior overall performance but also generates more concise responses on easy tasks and higher accuracy on complex ones.
📝 Abstract
Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose **Dare**, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that **Dare** consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at https://github.com/EtaYang10th/DARE.