🤖 AI Summary
To address the challenges of strong noise, low spatiotemporal resolution, and complex cardiac anatomy that degrade segmentation accuracy in echocardiography, this paper proposes the U-shaped Multiscale Visual Mamba network (UMVM). The method introduces a novel large-window multiscale Mamba module, integrates a cascaded residual encoder with dual-attention cross-layer feature fusion, and incorporates layer-wise auxiliary losses to enhance gradient propagation and feature discriminability. UMVM synergistically combines Mamba’s state-space modeling capability, a U-shaped encoder-decoder architecture, large-window state scanning, multiscale representation learning, and auxiliary supervised learning. Evaluated on EchoNet-Dynamic and CAMUS, UMVM achieves Dice scores of 95.01% and 87.35% for left ventricular endocardial and epicardial segmentation, respectively—surpassing state-of-the-art methods by 0.54–1.11 percentage points—and significantly improves clinical applicability.
📝 Abstract
Ultrasound imaging frequently encounters challenges, such as those related to elevated noise levels, diminished spatiotemporal resolution, and the complexity of anatomical structures. These factors significantly hinder the model's ability to accurately capture and analyze structural relationships and dynamic patterns across various regions of the heart. Mamba, an emerging model, is one of the most cutting-edge approaches that is widely applied to diverse vision and language tasks. To this end, this paper introduces a U-shaped deep learning model incorporating a large-window Mamba scale (LMS) module and a hierarchical feature fusion approach for echocardiographic segmentation. First, a cascaded residual block serves as an encoder and is employed to incrementally extract multiscale detailed features. Second, a large-window multiscale mamba module is integrated into the decoder to capture global dependencies across regions and enhance the segmentation capability for complex anatomical structures. Furthermore, our model introduces auxiliary losses at each decoder layer and employs a dual attention mechanism to fuse multilayer features both spatially and across channels. This approach enhances segmentation performance and accuracy in delineating complex anatomical structures. Finally, the experimental results using the EchoNet-Dynamic and CAMUS datasets demonstrate that the model outperforms other methods in terms of both accuracy and robustness. For the segmentation of the left ventricular endocardium (${LV}_{endo}$), the model achieved optimal values of 95.01 and 93.36, respectively, while for the left ventricular epicardium (${LV}_{epi}$), values of 87.35 and 87.80, respectively, were achieved. This represents an improvement ranging between 0.54 and 1.11 compared with the best-performing model.