Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the significant performance disparities observed among large language models under identical reinforcement learning (RL) training, noting that some models struggle to benefit from such optimization. The study introduces, for the first time, “distributional sharpness” as a key structural property governing RL compatibility and quantifies it using the Silhouette Coefficient. Building on this insight, the authors propose a Silhouette-Aware Reweighting strategy that adaptively reweights low-sharpness samples during training to improve learning efficiency. Extensive experiments across six mathematical reasoning benchmarks demonstrate consistent performance gains, with improvements of up to 5.9 points on AIME24, thereby validating both the trainability and broad applicability of distributional sharpness as a guiding principle for RL-based model refinement.

Technology Category

Application Category

📝 Abstract
Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.
Problem

Research questions and friction points this paper is trying to address.

RL-Friendliness
distributional clarity
large language models
reinforcement learning
probability distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributional clarity
reinforcement learning
Silhouette Coefficient
RL-friendliness
probability distribution
🔎 Similar Papers
No similar papers found.
S
Shaoning Sun
Tsinghua Shenzhen International Graduate School, Tsinghua University
M
Mingzhu Cai
Baidu Inc.
H
H. He
Baidu Inc.
B
Bingjin Chen
Baidu Inc.
Siqi Bao
Siqi Bao
Baidu
Natural Language ProcessingMedical Image Analysis
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision
H
Hua Wu
Baidu Inc.
Haifeng Wang
Haifeng Wang
Baidu
NLPMTSearchSpeechData Mining