Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image segmentation suffers from severe annotation scarcity. Existing Vision Transformer (ViT)-based masked image modeling (MIM) pretraining methods reconstruct only from the final-layer local aggregated features of the encoder, neglecting fine-grained semantic information encoded across intermediate layers. To address this, we propose Hierarchical Encoder-driven Masked Autoencoder (HE-MaskAE), the first MIM framework that introduces an encoder-driven reconstruction mechanism and a dense cross-layer decoding architecture. HE-MaskAE explicitly models and fuses multi-level ViT features to enhance representation discriminability. Pretrained on 10K unlabeled CT volumes, HE-MaskAE achieves state-of-the-art performance across seven medical segmentation benchmarks, improving average Dice score by 2.1–4.8% over prior methods. Comprehensive ablation studies and cross-dataset evaluations validate its superior generalizability and segmentation accuracy. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE
Problem

Research questions and friction points this paper is trying to address.

Addresses label scarcity in medical image segmentation
Enhances representation learning across Vision Transformer layers
Improves precision in medical downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Encoder-driven MAE
Encoder-driven reconstruction
Hierarchical dense decoding
🔎 Similar Papers
No similar papers found.
Fenghe Tang
Fenghe Tang
University of Science and Technology of China
Medical Image AnalysisFoundation model
Qingsong Yao
Qingsong Yao
Stanford University | ICT, CAS
Medical Image ComputingMedical Image Analysis
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
Chenxu Wu
Chenxu Wu
USTC
diffusion-based methods,multimodal learning
Zihang Jiang
Zihang Jiang
School of Biomedical Engineering, USTC, Suzhou Institute for Advanced Research
Computer VisionMedical Imaging3D
S
S. Kevin Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei, Anhui, 230026, P.R. China; Center for Medical Imaging, Robotics, and Analytic Computing &LEarning (MIRACLE), Suzhou Institute for Advanced Research, USTC, Suzhou 215123, China; State Key Laboratory of Precision and Intelligent Chemistry, USTC, Hefei, Anhui 230026, China; Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of C