🤖 AI Summary
To address the challenges of sparse rewards, neglect of textual sequence structure, and insufficient continuity modeling in RLHF, this paper proposes a segment-level reward modeling and optimization framework. We introduce a novel dynamic semantic segmentation mechanism that partitions generated text into semantically coherent segments; design a position-aware reward normalization function to mitigate length bias; and incorporate segment-level reward interpolation to enhance signal density. The method integrates segment-level reward modeling, dynamic semantic segmentation, position-aware normalization, and reward interpolation, coupled with preference-data-driven sequence-level supervised learning. Our approach achieves state-of-the-art performance on three major benchmarks—AlpacaEval 2.0, Arena-Hard, and MT-Bench—with ablation studies confirming the efficacy of each component. The core contribution is the first fine-grained, structure-aware, high-density segment-level reward modeling paradigm for RLHF.
📝 Abstract
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.