🤖 AI Summary
This work addresses safety concerns in generative vision-language models for pathological image captioning—specifically, hallucination, over-diagnosis, and factual inconsistency—by proposing a retrieval-guided generation (RGG) approach. Instead of generating descriptions from scratch, RGG retrieves visually similar historical cases and synthesizes expert-written reports from these retrieved examples to produce the final caption. This strategy preserves morphological terminology accuracy while substantially reducing unsupported diagnostic statements, thereby enhancing the auditability and transparency of model outputs. Evaluated on the ARCH dataset, the method achieves significantly higher semantic alignment with reference captions (cosine similarity of 0.60 versus 0.47 for baseline models). Pathologist assessments further confirm marked improvements in both terminological precision and diagnostic reliability compared to existing approaches.
📝 Abstract
Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of $\approx$0.60 versus $\approx$0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.