🤖 AI Summary
This work investigates the significant performance disparities observed among large language models under identical reinforcement learning (RL) training, noting that some models struggle to benefit from such optimization. The study introduces, for the first time, “distributional sharpness” as a key structural property governing RL compatibility and quantifies it using the Silhouette Coefficient. Building on this insight, the authors propose a Silhouette-Aware Reweighting strategy that adaptively reweights low-sharpness samples during training to improve learning efficiency. Extensive experiments across six mathematical reasoning benchmarks demonstrate consistent performance gains, with improvements of up to 5.9 points on AIME24, thereby validating both the trainability and broad applicability of distributional sharpness as a guiding principle for RL-based model refinement.
📝 Abstract
Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.