🤖 AI Summary
To address the high memory consumption and weak local detail modeling of Transformers in 3D medical image segmentation, this paper proposes WaveFormer—a wavelet-driven frequency-domain feature representation framework. Inspired by the top-down visual recognition mechanism in human vision, WaveFormer employs discrete wavelet transform (DWT) for multi-scale frequency-domain encoding, introduces a wavelet-basis feature aggregation and reconstruction module, and designs a lightweight 3D self-attention mechanism to jointly capture global context and fine-grained local structures. The architecture is biologically interpretable and significantly reduces both parameter count and GPU memory footprint. Evaluated on three major benchmarks—BraTS2023, FLARE2021, and KiTS2023—WaveFormer achieves state-of-the-art performance while substantially accelerating both training and inference. This work establishes a new paradigm for efficient and interpretable 3D medical image segmentation.
📝 Abstract
Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limi- tations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual rep- resentation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architec- ture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency de- tails while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where compu- tational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.