🤖 AI Summary
Medical image segmentation suffers from severe annotation scarcity. Existing Vision Transformer (ViT)-based masked image modeling (MIM) pretraining methods reconstruct only from the final-layer local aggregated features of the encoder, neglecting fine-grained semantic information encoded across intermediate layers. To address this, we propose Hierarchical Encoder-driven Masked Autoencoder (HE-MaskAE), the first MIM framework that introduces an encoder-driven reconstruction mechanism and a dense cross-layer decoding architecture. HE-MaskAE explicitly models and fuses multi-level ViT features to enhance representation discriminability. Pretrained on 10K unlabeled CT volumes, HE-MaskAE achieves state-of-the-art performance across seven medical segmentation benchmarks, improving average Dice score by 2.1–4.8% over prior methods. Comprehensive ablation studies and cross-dataset evaluations validate its superior generalizability and segmentation accuracy. The source code is publicly available.
📝 Abstract
Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE