GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing mammographic vision-language models neglect multi-view geometric relationships, leading to cross-view semantic inconsistency and loss of critical anatomical context. To address this, we propose a geometry-guided vision-language pretraining framework that, for the first time, introduces an anatomy-informed local alignment mechanism to explicitly model pixel-level correspondences between paired mammographic views. Furthermore, our method integrates triple contrastive learning—global-local, vision-vision, and vision-language—to jointly optimize multimodal representations under explicit multi-view geometric constraints. Evaluated on multiple public benchmarks, our approach achieves significant improvements in breast lesion classification accuracy and radiology report generation quality, consistently outperforming state-of-the-art baselines. It demonstrates strong robustness and generalization across varying annotation scales and modality configurations.

Technology Category

Application Category

📝 Abstract

Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-view relationship modeling in mammography VLMs

Overcomes domain differences between natural and medical images

Improves geometric context awareness for breast cancer detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-guided local alignment for multi-view mammography

Joint global and local contrastive learning approach

Visual-visual and visual-language multi-view correspondence learning

🔎 Similar Papers

Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography