Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical ultrasound video analysis faces dual challenges: scarcity of annotated data and difficulty in spatiotemporal modeling. To address these, we propose E-ViM³—a visual Mamba architecture preserving the native 3D structure of ultrasound videos, enhanced via self-supervised masked video pretraining to improve data efficiency. Our key contributions are: (1) Enclosure Global Tokens (EGT), a mechanism that strengthens global semantic aggregation by hierarchically integrating contextual information across spatial and temporal dimensions; and (2) Spatial-Temporal Chained (STC) masking, a novel strategy explicitly designed to respect the multi-scale spatiotemporal structure inherent in ultrasound sequences. Evaluated on four major benchmarks—EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS—E-ViM³ achieves state-of-the-art performance with only limited labeled data, demonstrating substantial gains in accuracy, generalizability, and clinical deployability.

Technology Category

Application Category

📝 Abstract
Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM$^3$ performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.
Problem

Research questions and friction points this paper is trying to address.

Addresses labeled data scarcity in ultrasound video analysis
Enhances 3D video structure modeling for space-time correlations
Improves data efficiency via self-supervised masked video modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Vision Mamba network for ultrasound videos
Enclosure Global Tokens enhance feature aggregation
Spatial-Temporal Chained masking for self-supervised pre-training
🔎 Similar Papers
No similar papers found.
Jiaheng Zhou
Jiaheng Zhou
Institute of Automation,Chinese Academy of Sciences
computer visionbiological and medical video analysis
Y
Yanfeng Zhou
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
W
Wei Fang
DAMO Academy, Alibaba Group; Hupan Laboratory, 310023, Hangzhou, China
Yuxing Tang
Yuxing Tang
Alibaba DAMO Academy USA
Computer VisionMachine LearningImage RecognitionDeep LearningMedical Imaging
Le Lu
Le Lu
Ant Group, IEEE Fellow, MICCAI Board Member (2021-2025)
Computer VisionMedical Image AnalysisMedical Image ComputingBiomedical Image Analysis
G
Ge Yang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences