Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

This work addresses the limitations of reinforcement learning in knowledge-intensive question answering, where existing approaches often rely on coarse answer-level rewards or expensive, unreliable neural verifiers—particularly problematic for rare facts. The authors propose CorVer, the first method to leverage Wikipedia co-occurrence statistics as a lightweight, process-level reward signal, eliminating the need for large language models or complex verification pipelines. By assessing sentence plausibility through co-occurrence patterns and aligning this signal to token-level advantages, CorVer enables efficient, fine-grained supervision. Experiments demonstrate that CorVer outperforms baseline methods across all 30 model–benchmark combinations, achieving an average improvement of 4.1 percentage points on TriviaQA, surpassing 18 out of 20 neural verifier configurations, and accelerating training by 4.8–8.4×.

📝 Abstract

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

factual question answering

reward design

process supervision

fact verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

process supervision

corpus-grounded reward

factual question answering