Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical imaging suffers from scarce annotated data, leading to poor semantic alignment and weak generalization in self-supervised masked image modeling (MIM). To address this, we propose Text-guided Controllable Masking (TCM), a novel framework that leverages vision-language models to parse diagnostic text prompts and dynamically localize anatomically or pathologically salient regions. TCM performs region-aware masking at a low mask ratio (40%) and integrates contrastive learning—eliminating reliance on either supervised signals or reconstruction-based heuristics. By innovatively unifying prompt learning with MIM, TCM significantly enhances representation quality across diverse modalities: on brain MRI, chest CT, and pulmonary X-ray datasets, it improves classification accuracy by up to 3.1%, and boosts object detection performance by +1.3 BoxAP and +1.1 MaskAP. These results demonstrate TCM’s strong cross-modal and cross-task adaptability and generalization capability.

Technology Category

Application Category

📝 Abstract
The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40% vs. 70%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of annotated medical data via self-supervised learning
Improves semantic alignment by emphasizing diagnostically relevant image regions
Enhances cross-task generalization across diverse medical imaging modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses text-guided masking for medical image analysis
Leverages vision-language models for region localization
Applies differentiated masking to emphasize relevant regions
🔎 Similar Papers
No similar papers found.