MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

📅 2024-04-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D medical image Masked Autoencoders (MAEs) suffer from insufficient discriminative pre-trained representations due to the lack of hierarchical modeling. To address this, we propose a multi-granularity “Mask-in-Mask” framework. Our method introduces a novel cross-scale masked reconstruction mechanism, integrates anatomical structure-guided cross-level feature alignment, and employs a hybrid backbone network to enable efficient hierarchical representation learning. Evaluated on 13 public 3D medical imaging datasets, our approach consistently outperforms state-of-the-art self-supervised methods across organ/lesion/tumor segmentation and disease classification tasks, achieving new SOTA performance. Furthermore, pre-training on a large-scale cohort of 10,000 CT scans empirically validates the critical role of data scale in building robust medical foundation models.

Technology Category

Application Category

📝 Abstract
The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel extit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, extit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.
Problem

Research questions and friction points this paper is trying to address.

3D Medical Image Analysis
Masked Autoencoder (MAE)
Pre-training Methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-in-Mask (MiM) Pre-training
3D Medical Image Analysis
Hybrid Backbone Network
🔎 Similar Papers
No similar papers found.
Jiaxin Zhuang
Jiaxin Zhuang
PhD in CSE, HKUST
Computer VisionMedical Image AnalysisArtificial Intelligence
L
Linshan Wu
Departments of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong
Q
Qiong Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
V
V. Vardhanabhuti
Department of Diagnostic Radiology, The University of Hong Kong, Hong Kong SAR
Lin Luo
Lin Luo
College of Engineering, Peking University, Beijing, China
H
Hao Chen
Department of Computer Science and Engineering, Department of Chemical and Biological Engineering, and State Key Laboratory of Molecular Neuroscience, Hong Kong University of Science and Technology, Hong Kong, and HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China