Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing reward models face a fundamental trade-off between granularity and semantic coherence: coarse-grained models lack fine-grained credit assignment, while fine-grained models often compromise sentence-level semantic integrity. To address this, we propose Sentence-level Reward Modeling (SRM), which segments LLM responses into sentences, applies positional differencing to identify each sentence’s relative contribution, and employs a trainable adaptive attention mechanism to aggregate sentence-level signals into a response-level scalar score—enabling Bradley–Terry likelihood-based training. This work constitutes the first systematic investigation of intermediate-granularity reward modeling, balancing semantic coherence with fine-grained alignment capability. Experiments demonstrate that SRM achieves a +2.7% improvement on RewardBench and consistently outperforms all baselines on AlpacaEval, validating its effectiveness and generalizability for human preference alignment.

Technology Category

Application Category

📝 Abstract

Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).

Problem

Research questions and friction points this paper is trying to address.

Improve alignment of LLMs with human preferences

Address sparse reward challenges in reinforcement learning

Enhance reward modeling with sentence-level granularity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intermediate-grained sentence-level reward modeling

Novel attention mechanism for score aggregation

Differential operations for sentence reward calculation

🔎 Similar Papers

No similar papers found.