🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucinations and clinically inaccurate outputs in chest X-ray report generation, hindering clinical deployment. To address this, we propose “Look & Mark,” a novel dual-modal visual grounding prompting paradigm that directly incorporates radiologists’ eye-tracking trajectories (“Look”) and lesion bounding-box annotations (“Mark”) into MLLM inference—without any model fine-tuning. Our approach integrates spatial encoding, in-context learning (ICL), and cross-model adaptation across CXR-LLaVA, LLaVA-Med, and LLaVA-OV. Achieving zero training cost, it significantly improves clinical alignment: overall performance increases by 1.2% for CXR-LLaVA and 9.2% for LLaVA-Med; LLaVA-OV achieves a state-of-the-art clinical average accuracy of 87.3%. Expert evaluation confirms a statistically significant reduction of 0.43 clinically critical errors per report.
📝 Abstract
Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look&Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.