Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

📅 2026-01-15

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of sparse rewards in reinforcement learning for long-context reasoning, which often leads models to rely on spurious correlations rather than retrieving critical evidence. To overcome this limitation, the authors propose an Evidence-Augmented Reasoning (EAR) paradigm. Through tree-structured evidence sampling, they identify evidence extraction as a key performance bottleneck and introduce the Evidence-Augmented Policy Optimization (EAPO) algorithm. EAPO provides dense process supervision via group-relative evidence rewards and incorporates an adaptive co-evolution mechanism between the reward model and policy to iteratively improve evidence quality. Evaluated across eight benchmark datasets, the proposed method significantly outperforms current state-of-the-art approaches, demonstrating substantial gains in long-context reasoning performance.

Technology Category

Application Category

📝 Abstract

While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded"lucky guesses,"leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.

Problem

Research questions and friction points this paper is trying to address.

long-context reasoning

reward sparsity

evidence retrieval

reinforcement learning

process supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence-Augmented Reasoning

Group-Relative Evidence Reward

Reward-Policy Co-Evolution