🤖 AI Summary
Existing multi-view medical image classification methods often neglect cross-view correlations and suffer from limited receptive fields (in CNNs) or quadratic computational complexity (in Transformers). To address these limitations, we propose the first pure-Mamba architecture for two-stage cross-view fusion. In Stage I, Mamba models long-range spatial dependencies within each individual view; in Stage II, a state-space model explicitly captures discriminative cross-view discrepancy features. Our approach entirely replaces convolutional and self-attention mechanisms with selective scanning and hardware-aware parallelization. Evaluated on MURA, CheXpert, and DDSM benchmarks, the method consistently outperforms state-of-the-art CNN- and Transformer-based multi-view models, achieving substantial gains in classification accuracy while maintaining high computational efficiency and superior representational capacity.
📝 Abstract
Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.