Multimodal Medical Endoscopic Image Analysis via Progressive Disentangle-aware Contrastive Learning

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Accurate segmentation of hypopharyngeal tumors remains challenging due to limited discriminative power of single-modality imaging—particularly white-light imaging (WLI)—in capturing complex anatomical and pathological characteristics. To address this, we propose an “Alignment–Disentanglement–Fusion” multimodal learning framework that jointly models WLI and narrow-band imaging (NBI) for the first time. Our approach introduces a multi-scale distribution alignment mechanism to ensure cross-modal feature consistency; employs progressive feature disentanglement coupled with disentanglement-aware contrastive learning to explicitly separate modality-specific and shared semantic representations; and leverages a Transformer-based architecture for robust multimodal representation fusion. Evaluated on multiple real-world clinical datasets, our method achieves state-of-the-art performance, improving Dice scores by 3.2–5.8 percentage points over existing methods. It demonstrates strong generalizability, particularly excelling in cases with ambiguous boundaries and low image contrast.

Technology Category

Application Category

📝 Abstract

Accurate segmentation of laryngo-pharyngeal tumors is crucial for precise diagnosis and effective treatment planning. However, traditional single-modality imaging methods often fall short of capturing the complex anatomical and pathological features of these tumors. In this study, we present an innovative multi-modality representation learning framework based on the `Align-Disentangle-Fusion' mechanism that seamlessly integrates 2D White Light Imaging (WLI) and Narrow Band Imaging (NBI) pairs to enhance segmentation performance. A cornerstone of our approach is multi-scale distribution alignment, which mitigates modality discrepancies by aligning features across multiple transformer layers. Furthermore, a progressive feature disentanglement strategy is developed with the designed preliminary disentanglement and disentangle-aware contrastive learning to effectively separate modality-specific and shared features, enabling robust multimodal contrastive learning and efficient semantic fusion. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art approaches, achieving superior accuracy across diverse real clinical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Accurate segmentation of laryngo-pharyngeal tumors for diagnosis

Overcoming limitations of single-modality medical imaging methods

Integrating 2D White Light and Narrow Band Imaging pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Align-Disentangle-Fusion mechanism integration

Multi-scale distribution alignment across transformers

Progressive feature disentanglement with contrastive learning

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis