Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

168K/year
🤖 AI Summary
Vision Transformers (ViTs) face significant challenges in fine-grained classification of breast cancer mammograms due to attention dispersion caused by high-resolution images and the inherent difficulty of small inter-class differences coupled with large intra-class variations among lesions. To address these issues, this work proposes a novel framework that integrates region-of-interest (RoI)-guided token compression, contrastive learning, and localization-aware DINOv2-pretrained ViTs. Specifically, RoI tokens are extracted under the guidance of object detection to focus attention on diagnostically relevant regions, while RoI-based contrastive learning with hard negative mining enhances feature discriminability. Evaluated on public mammography datasets, the proposed method substantially outperforms existing baselines, demonstrating heightened sensitivity to subtle abnormalities and improved classification accuracy, thereby showing strong potential for clinical screening applications.

Technology Category

Application Category

📝 Abstract
Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters
Problem

Research questions and friction points this paper is trying to address.

breast cancer classification
mammograms
Vision Transformers
fine-grained classification
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer
Region of Interest
Contrastive Learning
DINOv2
Mammography
S
Samyak Sanghvi
Department of Computer Science and Engineering, IIT Delhi, New Delhi, India
P
Piyush Miglani
Yardi School of AI, IIT Delhi, New Delhi, India
S
Sarvesh Shashikumar
Department of Computer Science and Engineering, IIT Delhi, New Delhi, India
K
Kaustubh R Borgavi
Yardi School of AI, IIT Delhi, New Delhi, India
V
Veenu Singla
Department of Radiodiagnosis, PGIMER Chandigarh, Chandigarh, India
Chetan Arora
Chetan Arora
Professor, IIT Delhi
Computer VisionMachine Learning