Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models face a fundamental trade-off between granularity and semantic coherence: coarse-grained models lack fine-grained credit assignment, while fine-grained models often compromise sentence-level semantic integrity. To address this, we propose Sentence-level Reward Modeling (SRM), which segments LLM responses into sentences, applies positional differencing to identify each sentence’s relative contribution, and employs a trainable adaptive attention mechanism to aggregate sentence-level signals into a response-level scalar score—enabling Bradley–Terry likelihood-based training. This work constitutes the first systematic investigation of intermediate-granularity reward modeling, balancing semantic coherence with fine-grained alignment capability. Experiments demonstrate that SRM achieves a +2.7% improvement on RewardBench and consistently outperforms all baselines on AlpacaEval, validating its effectiveness and generalizability for human preference alignment.

Technology Category

Application Category

📝 Abstract
Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).
Problem

Research questions and friction points this paper is trying to address.

Improve alignment of LLMs with human preferences
Address sparse reward challenges in reinforcement learning
Enhance reward modeling with sentence-level granularity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intermediate-grained sentence-level reward modeling
Novel attention mechanism for score aggregation
Differential operations for sentence reward calculation
🔎 Similar Papers
No similar papers found.
Wenjie Qiu
Wenjie Qiu
South China University of Technology
Large scale global optimization、Black box optimization、Evolutionary computation
Yi-Chen Li
Yi-Chen Li
Nanjing University
Reinforcement LearningImitation LearningRLHF
X
Xuqin Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
T
Tianyi Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Y
Yihang Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Zongzhang Zhang
Zongzhang Zhang
Nanjing University
Artificial IntelligenceReinforcement LearningProbabilistic PlanningMulti-Agent Systems
Y
Yang Yu
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China; Polixir Technologies