LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models (RMs) are primarily designed for short-context settings and neglect consistency modeling between responses and extended historical trajectories, limiting their applicability to real-world LLM agents. This work identifies systematic vulnerabilities in their long-context preference judgment capability. Method: We introduce Long-RewardBench—the first benchmark dedicated to evaluating RMs on long-context preference ranking—and propose a general multi-stage training framework that jointly leverages pairwise comparison and best-of-N objectives to progressively enhance long-range dependency awareness and consistency discrimination. Contribution/Results: Our 8B-parameter LongRM substantially outperforms a 70B baseline on long-context tasks, matching the performance of Gemini 2.5 Pro, while retaining state-of-the-art accuracy on short-context benchmarks. This demonstrates that compact models, when trained with structured, stage-wise objectives, can achieve efficient and effective long-context reward modeling.

Technology Category

Application Category

📝 Abstract
Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reward models in long-context scenarios for consistency
Addressing fragility of current models in long history trajectories
Developing robust long-context reward models while preserving short-context capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed a benchmark for long-context reward model evaluation
Introduced a multi-stage training strategy for scaling models
Created robust long-context reward models preserving short-context capability