🤖 AI Summary
This study addresses the challenges of training retrieval-augmented generation (RAG) systems for medical question answering when using natural language inference (NLI) checkers as process rewards, where signal collapse and reward hacking often impede effective learning. By analyzing the output distributions of NLI checkers rather than relying solely on external accuracy metrics, the authors reveal that signal collapse stems from large language models’ log-probability distributions being overly biased toward neutral labels. They further demonstrate that moderate-strength reward signals can prevent cascading reward hacking and that signal efficacy is inherently dependent on the policy itself. Within the GRPO framework, experiments with Qwen2.5-7B, Qwen3-4B, and Llama-3.1-8B agents coupled with various NLI checkers show that models trained using a locally calibrated, moderate-signal MedNLI classifier significantly outperform zero-shot baselines across multiple medical QA benchmarks—achieving a 12% gain in BERTScore—without requiring GPT-class models.
📝 Abstract
Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.