🤖 AI Summary
This work addresses the challenge of sparse rewards in reinforcement learning for long-context reasoning, which often leads models to rely on spurious correlations rather than retrieving critical evidence. To overcome this limitation, the authors propose an Evidence-Augmented Reasoning (EAR) paradigm. Through tree-structured evidence sampling, they identify evidence extraction as a key performance bottleneck and introduce the Evidence-Augmented Policy Optimization (EAPO) algorithm. EAPO provides dense process supervision via group-relative evidence rewards and incorporates an adaptive co-evolution mechanism between the reward model and policy to iteratively improve evidence quality. Evaluated across eight benchmark datasets, the proposed method significantly outperforms current state-of-the-art approaches, demonstrating substantial gains in long-context reasoning performance.
📝 Abstract
While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded"lucky guesses,"leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.