Look&Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from hallucinations and clinically inaccurate outputs in chest X-ray report generation, hindering clinical deployment. To address this, we propose “Look & Mark,” a novel dual-modal visual grounding prompting paradigm that directly incorporates radiologists’ eye-tracking trajectories (“Look”) and lesion bounding-box annotations (“Mark”) into MLLM inference—without any model fine-tuning. Our approach integrates spatial encoding, in-context learning (ICL), and cross-model adaptation across CXR-LLaVA, LLaVA-Med, and LLaVA-OV. Achieving zero training cost, it significantly improves clinical alignment: overall performance increases by 1.2% for CXR-LLaVA and 9.2% for LLaVA-Med; LLaVA-OV achieves a state-of-the-art clinical average accuracy of 87.3%. Expert evaluation confirms a statistically significant reduction of 0.43 clinically critical errors per report.

Technology Category

Application Category

📝 Abstract

Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look&Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.

Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations and errors in AI-generated radiology reports

Integrating eye fixations and bounding boxes for better accuracy

Improving reliability of AI in medical image analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates eye fixations and bounding boxes

Uses in-context learning, no retraining needed

Reduces clinical errors, enhances accuracy

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model