🤖 AI Summary
Addressing the dual challenges of modeling long-range dependencies and low computational efficiency in 3D medical image segmentation, this paper proposes the first multimodal framework integrating Mamba-based sequential modeling with the Kolmogorov–Arnold Network (KAN). Our key contributions are: (1) a novel 3D Grouped Rational KAN (3D-GR-KAN) module—the first application of grouped rational KANs to 3D volumetric data—achieving superior expressivity with parameter efficiency; (2) an Enhanced Gated Spatial Convolution (EGSC) operator that strengthens local-global spatial awareness; and (3) a dual-path CLIP-guided text-driven mechanism enabling semantic consistency and lesion-level fine-grained alignment. Evaluated on the MSD and KiTS23 benchmarks, our method achieves state-of-the-art performance—improving Dice score by +3.2% and inference speed by ×2.1 (FPS)—while supporting lightweight clinical deployment. The code is publicly available.
📝 Abstract
3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at https://github.com/yhy-whu/TK-Mamba.