🤖 AI Summary
This work proposes SVGFormer, a decoder-free dual-scale graph neural network that addresses the limitations of conventional 3D medical image models, which rely on parameter-heavy encoder-decoder architectures and expend substantial computational resources on spatial reconstruction, thereby compromising both brain tumor localization accuracy and model interpretability. SVGFormer constructs a semantic graph via content-aware supervoxel grouping and integrates a patch-level Vision Transformer with a supervoxel-level graph attention network to jointly model local details and inter-regional dependencies, dedicating all model capacity to feature learning. The method achieves, for the first time, intrinsic interpretability across both voxel and regional scales. On the BraTS dataset, it attains a node classification F1-score of 0.875 and a tumor proportion regression MAE of 0.028, significantly outperforming existing approaches.
📝 Abstract
Modern vision backbones for 3D medical imaging typically process dense voxel grids through parameter-heavy encoder-decoder structures, a design that allocates a significant portion of its parameters to spatial reconstruction rather than feature learning. Our approach introduces SVGFormer, a decoder-free pipeline built upon a content-aware grouping stage that partitions the volume into a semantic graph of supervoxels. Its hierarchical encoder learns rich node representations by combining a patch-level Transformer with a supervoxel-level Graph Attention Network, jointly modeling fine-grained intra-region features and broader inter-regional dependencies. This design concentrates all learnable capacity on feature encoding and provides inherent, dual-scale explainability from the patch to the region level. To validate the framework's flexibility, we trained two specialized models on the BraTS dataset: one for node-level classification and one for tumor proportion regression. Both models achieved strong performance, with the classification model achieving a F1-score of 0.875 and the regression model a MAE of 0.028, confirming the encoder's ability to learn discriminative and localized features. Our results establish that a graph-based, encoder-only paradigm offers an accurate and inherently interpretable alternative for 3D medical image representation.