Lessons from Training Grounded LLMs with Verifiable Rewards

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) still suffer from factual inaccuracies, citation mismatches, and inappropriate refusals when generating verifiable, trustworthy responses. To address these challenges, we propose a reinforcement learning framework that requires no human-annotated reasoning traces. Our method introduces the first multi-stage GRPO (Gradient-based Reinforcement Policy Optimization) training paradigm, decoupling optimization of answer/citation generation from refusal decisions. We design a reasoning-free, verifiability-aware reward function that jointly optimizes response correctness, citation sufficiency, and refusal appropriateness. The approach integrates retrieval-augmented generation (RAG), GPT-4-distilled instruction fine-tuning, and multi-task evaluation on long-context QA benchmarks (ASQA, QAMPARI, ELI5, ExpertQA). Experiments demonstrate substantial improvements in identifying unanswerable questions and citation accuracy, outperforming instruction-fine-tuned baselines across multiple benchmarks and significantly enhancing factual consistency and response verifiability of LLMs.

Technology Category

Application Category

📝 Abstract
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM grounding with RL and verifiable rewards
Improving citation accuracy and refusal quality in responses
Optimizing answer correctness via reasoning-augmented training stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO method for verifiable outcome-based rewards
Two-stage training optimizing answer and refusal
GPT-4 distillation combined with GRPO
🔎 Similar Papers
No similar papers found.
S
Shang Hong Sim
Singapore University of Technology and Design
T
Tej Deep Pala
Singapore University of Technology and Design
V
Vernon Toh
Singapore University of Technology and Design
Hai Leong Chieu
Hai Leong Chieu
Distinguished Member of Technical Staff, DSO National Laboratories, Singapore
Artificial IntelligenceMachine LearningNatural Language Processing
Amir Zadeh
Amir Zadeh
Staff ML Researcher, Lambda
Multimodal Machine LearningNLPComputer VisionSpeech and Audio Processing
C
Chuan Li
Lambda Labs
Navonil Majumder
Navonil Majumder
Singapore University of Technology and Design
Natural Language ProcessingMachine LearningNeural NetworksDeep Learning
S
Soujanya Poria
Singapore University of Technology and Design