🤖 AI Summary
This study systematically uncovers bidirectional hallucinations in large language models (LLMs) applied to medical imaging: image-to-text (e.g., radiology report generation from X-ray/CT/MRI) and text-to-image (e.g., clinical-prompt-driven synthetic imaging). Addressing core deficiencies—including factual inconsistency and anatomical implausibility—we propose a multimodal (X-ray, CT, MRI) evaluation framework grounded in dual expert criteria: clinical semantic consistency and anatomical plausibility. To our knowledge, this is the first work to conduct a controlled, cross-task comparative analysis of hallucinations in both medical image understanding and generation. We identify and characterize the synergistic impact of architectural biases and training data limitations on medical hallucination emergence. Finally, we outline clinically grounded mitigation strategies—emphasizing interpretability, domain-specific constraints, and human-in-the-loop validation—to enhance model reliability. Our findings provide empirical evidence and methodological foundations for developing safe, trustworthy AI systems in clinical imaging.
📝 Abstract
Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.