Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) frequently generate clinically critical hallucinations in radiology report generation, posing significant risks to patient safety. Existing hallucination detection methods lack sentence-level fine-grained discrimination and exhibit poor generalizability across diverse LVLMs. To address this, we propose a lightweight, context-aware sentence-level Process Reward Model (PRM) that performs model-agnostic clinical fact verification without requiring access to internal model states. Trained via weak supervision on MIMIC-CXR, the PRM jointly leverages visual context and preceding textual context to predict the factual correctness of individual sentences. It is deployed for report filtering and weighted re-ranking. Experiments show that our method achieves a 7.4% relative improvement in F1-CheXbert and a 0.6% gain in BERTScore. Further filtering low-quality reports yields an additional 4.5% F1-CheXbert improvement, substantially enhancing clinical safety and real-world deployability.

Technology Category

Application Category

📝 Abstract
Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
Problem

Research questions and friction points this paper is trying to address.

Detecting clinically critical hallucinations in LVLM-generated radiology reports
Providing sentence-level verification without internal model state access
Ensuring robust generalization across different LVLM generators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sentence-level Process Reward Model for LVLM verification
Weakly-supervised fine-tuning on MIMIC-CXR dataset
Model-agnostic safety layer without internal activations
🔎 Similar Papers
No similar papers found.