🤖 AI Summary
To address insufficient exploitation of inter-frame motion information in echocardiographic segmentation, this paper proposes a lightweight, plug-and-play Multi-head KQV Cross-temporal Attention Module (TAM). TAM explicitly models cross-frame feature interactions to efficiently fuse temporal motion cues. It is the first module to support iterative temporal feature extraction and cross-modal interaction, and is compatible with both CNN- and Transformer-based backbones (e.g., UNet, SwinUNetR, I2UNet) for 2D and 3D ultrasound data. Evaluated on the CAMUS (2D) and MITEA (3D) benchmarks, TAM consistently improves segmentation accuracy across diverse architectures. When integrated into FCN8s, it outperforms contemporary methods, demonstrating robustness, strong generalization, and architecture-agnostic plug-and-play capability—without requiring architectural modifications or additional training overhead.
📝 Abstract
Cardiac anatomy segmentation is essential for clinical assessment of cardiac function and disease diagnosis to inform treatment and intervention. In performing segmentation, deep learning (DL) algorithms improved accuracy significantly compared to traditional image processing approaches. More recently, studies showed that enhancing DL segmentation with motion information can further improve it. A range of methods for injecting motion information has been proposed, but many of them increase the dimensionality of input images (which is computationally expensive) or have not used an optimal method to insert motion information, such as non-DL registration, non-attention-based networks or single-headed attention. Here, we present a novel, computation-efficient alternative where a novel, scalable temporal attention module (TAM) extracts temporal feature interactions multiple times and where TAM has a multi-headed, KQV projection cross-attention architecture. The module can be seamlessly integrated into a wide range of existing CNN- or Transformer-based networks, providing novel flexibility for inclusion in future implementations. Extensive evaluations on different cardiac datasets, 2D echocardiography (CAMUS), and 3D echocardiography (MITEA) demonstrate the model's effectiveness when integrated into well-established backbone networks like UNet, FCN8s, UNetR, SwinUNetR, and the recent I2UNet. We further find that the optimized TAM-enhanced FCN8s network performs well compared to contemporary alternatives. Our results confirm TAM's robustness, scalability, and generalizability across diverse datasets and backbones.