Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically uncovers bidirectional hallucinations in large language models (LLMs) applied to medical imaging: image-to-text (e.g., radiology report generation from X-ray/CT/MRI) and text-to-image (e.g., clinical-prompt-driven synthetic imaging). Addressing core deficiencies—including factual inconsistency and anatomical implausibility—we propose a multimodal (X-ray, CT, MRI) evaluation framework grounded in dual expert criteria: clinical semantic consistency and anatomical plausibility. To our knowledge, this is the first work to conduct a controlled, cross-task comparative analysis of hallucinations in both medical image understanding and generation. We identify and characterize the synergistic impact of architectural biases and training data limitations on medical hallucination emergence. Finally, we outline clinically grounded mitigation strategies—emphasizing interpretability, domain-specific constraints, and human-in-the-loop validation—to enhance model reliability. Our findings provide empirical evidence and methodological foundations for developing safe, trustworthy AI systems in clinical imaging.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.
Problem

Research questions and friction points this paper is trying to address.

Examining hallucinations in LLMs for medical imaging tasks
Analyzing errors in image-text and text-image medical outputs
Improving safety of LLM-driven medical imaging systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes hallucinations in medical imaging tasks
Evaluates outputs using expert informed criteria
Studies both image understanding and generation
🔎 Similar Papers