🤖 AI Summary
Medical vision-language models (VLMs) frequently generate quantitative measurement hallucinations—e.g., erroneous endotracheal tube positioning—in chest X-ray report generation, undermining clinical reliability. To address this, we propose FactCheXcker, a modular verification framework that pioneers modeling radiological measurement validation as an executable Python code generation task. Given a VLM-generated report and rule-guided extraction of measurable findings, FactCheXcker leverages a large language model to synthesize numerical solving code, automatically correcting critical measurements and integrating corrections into the final report. Our approach requires no fine-tuning and enables plug-and-play hallucination mitigation. Evaluated on MIMIC-CXR across 11 state-of-the-art report generation models, FactCheXcker reduces mean absolute error (MAE) in quantitative measurements by 94.0% on average, substantially improving measurement accuracy while preserving linguistic quality and clinical readability.
📝 Abstract
Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report. After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0% in reducing measurement hallucinations measured by mean absolute error.