๐ค AI Summary
This work addresses the challenge in open-ended question answering where existing reinforcement learning approaches employ fixed weights for positive and negative samples, struggling to balance response diversity with training stability. The authors propose an entropy-driven adaptive weighting strategy that distinguishes positive from negative samples based on reward means and dynamically adjusts the weight of positive samples according to policy entropyโreducing the weight when entropy decreases to sustain exploration and increasing it when entropy rises to accelerate convergence. This approach reveals, for the first time, that negative samples primarily govern diversity and performance ceilings, while positive samples dictate generation quality and training stability, thereby effectively mitigating entropy collapse. Evaluated on two medical question-answering datasets, the method consistently outperforms fixed-weight baselines, achieving significant improvements in both diversity and stability.
๐ Abstract
Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.