Boosting Medical Visual Understanding From Multi-Granular Language Learning

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical images often exhibit multi-label annotations (e.g., co-occurring pathologies) and multi-granular textual descriptions (e.g., concise diagnoses and detailed clinical interpretations), yet prevailing vision–language pretraining methods (e.g., CLIP) only model single-label, single-granularity alignment—failing to capture clinical semantic complexity. To address this, we propose Multi-Granular Language Learning (MGLL), the first framework unifying cross-granularity text fusion, structured multi-label supervision, and point-wise soft label learning. MGLL enforces consistency between fine- and coarse-grained text representations via smoothed KL divergence constraints and supports plug-and-play integration. Extensive experiments on multiple medical imaging benchmarks demonstrate substantial improvements over state-of-the-art methods, with consistent gains across downstream tasks—including classification and cross-modal retrieval. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.
Problem

Research questions and friction points this paper is trying to address.

Addressing single-granularity alignment limitations in medical imaging
Improving multi-label and cross-granularity alignment for medical images
Enhancing visual-text alignment across different annotation granularities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granular contrastive learning for medical imaging
Leverages structured multi-label supervision across granularities
Uses smooth KL divergence for cross-granularity consistency
🔎 Similar Papers
No similar papers found.