MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

πŸ“… 2026-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

190K/year
πŸ€– AI Summary
Current vision-language models struggle to accurately align critical textual descriptions in medical reports with subtle yet clinically significant local regions in images. This work proposes a multi-task, multi-instance fine-grained image-text alignment approach that decouples anatomical structures from pathological findings. By introducing an anatomy-conditioned image patch encoder and integrating specially trained text embeddings with a multi-instance alignment mechanism, the method achieves precise correspondence between sentence-level diagnostic descriptions and relevant image regions. Evaluated across multiple downstream tasks, the proposed approach significantly outperforms state-of-the-art baselines, effectively enabling accurate mapping between multiple lesions described in free-text radiology reports and their corresponding visual regions.

Technology Category

Application Category

πŸ“ Abstract
In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.
Problem

Research questions and friction points this paper is trying to address.

vision-language alignment
medical imaging
diagnostic reports
multi-instance learning
fine-grained localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language alignment
multi-instance learning
medical image-text alignment
anatomical-diagnostic disentanglement
patch-wise encoding
F
Felicia Bader
Computational Imaging Research Lab, Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Austria; Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria
P
Philipp SeebΓΆck
Computational Imaging Research Lab, Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Austria; Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria; Medical Anomaly Detection (MANO) Group, Computational Imaging Research (CIR), Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Austria
A
Anastasia Bartashova
Department of Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Austria
U
Ulrike Attenberger
Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria; Department of Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Austria
Georg Langs
Georg Langs
Medical University of Vienna, CIR Lab
Machine Learning in NeuroImagingFunctional Connectivity