Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address gradient signal dilution and slow convergence in critic-free policy optimization—caused by redundant samples and tokens—this paper proposes D³S, a dynamic dual-level downsampling framework. At the sample level, D³S selects trajectory subsets with maximal advantage variance; at the token level, it filters high-impact, uncertain tokens using the absolute product of advantage and policy entropy (|A × H|), augmented by a curriculum-learning-inspired dynamic scheduling mechanism. D³S is the first method to achieve joint, adaptive pruning of samples and tokens within the critic-free paradigm. We theoretically establish that its sample selection criterion is positively correlated with the upper bound of the policy gradient estimator. Experiments on Qwen2.5 and Llama3.1 demonstrate that D³S significantly improves training efficiency and generalization across multiple reasoning benchmarks, while consuming fewer samples and tokens.

Technology Category

Application Category

📝 Abstract
Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly, as critical learning signals are diluted by an abundance of uninformative samples and tokens. To tackle this challenge, we propose the extbf{Dynamic Dual-Level Down-Sampling (D$^3$S)} framework that prioritizes the most informative samples and tokens across groups to improve the efficient of policy optimization. D$^3$S operates along two levels: (1) the sample-level, which selects a subset of rollouts to maximize advantage variance ($ ext{Var}(A)$). We theoretically proven that this selection is positively correlated with the upper bound of the policy gradient norms, yielding higher policy gradients. (2) the token-level, which prioritizes tokens with a high product of advantage magnitude and policy entropy ($|A_{i,t}| imes H_{i,t}$), focusing updates on tokens where the policy is both uncertain and impactful. Moreover, to prevent overfitting to high-signal data, D$^3$S employs a dynamic down-sampling schedule inspired by curriculum learning. This schedule starts with aggressive down-sampling to accelerate early learning and gradually relaxes to promote robust generalization. Extensive experiments on Qwen2.5 and Llama3.1 demonstrate that integrating D$^3$S into advanced RL algorithms achieves state-of-the-art performance and generalization while requiring extit{fewer} samples and tokens across diverse reasoning benchmarks. Our code is added in the supplementary materials and will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Reduces slow convergence in critic-free policy optimization methods
Selects informative samples and tokens to improve learning efficiency
Prevents overfitting through dynamic down-sampling schedule
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic dual-level down-sampling framework for policy optimization
Sample-level selection maximizes advantage variance for gradients
Token-level prioritizes high advantage and policy entropy tokens
🔎 Similar Papers
No similar papers found.
C
Chao Wang
Tsinghua University
T
Tao Yang
WeChat, Tencent
H
Hongtao Tian
WeChat, Tencent
Y
Yunsheng Shi
WeChat, Tencent
Qiyao Ma
Qiyao Ma
University of California, Davis
Large Language ModelsInformation Retrieval
X
Xiaotao Liu
WeChat, Tencent
T
Ting Yao
WeChat, Tencent
Wenbo Ding
Wenbo Ding
UNIVERSITY AT BUFFALO
securityMachine Learning