Masked Image Modeling: A Survey

📅 2024-08-13
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on masked image modeling (MIM) for self-supervised visual representation learning lacks a unified formalization of pretraining paradigms and standardized, comparable evaluation. Method: We formally categorize MIM into two principal paradigms—reconstruction-based and contrastive-based—and construct an interpretable, hierarchical taxonomy via expert curation and agglomerative clustering. We conduct systematic, unified benchmarking of over 20 state-of-the-art models on ImageNet and other standard datasets. Contribution/Results: Our analysis identifies critical open challenges—including cross-paradigm integration, long-tailed masking strategies, and compute-accuracy trade-offs. To foster reproducibility and standardization, we publicly release a structured literature repository and an extensible evaluation framework on GitHub. This work establishes foundational infrastructure for rigorous, comparable advancement in MIM research.

Technology Category

Application Category

📝 Abstract
In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g.~pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work. We supplement our survey with the following public repository containing organized references: https://github.com/vladhondru25/MIM-Survey.
Problem

Research questions and friction points this paper is trying to address.

Masked Image Modeling
Computer Vision
Automatic Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Image Modeling
Hierarchical Clustering Algorithm
Performance Benchmarking
🔎 Similar Papers
Vlad Hondru
Vlad Hondru
PhD Student, University of Bucharest, Romania; Machine Learning Engineer, eMAG
Machine LearningComputer VisionNLPDiffusion Models
F
Florinel Alin Croitoru
Department of Computer Science, University of Bucharest, 14 Academiei, Bucharest, 010014, Romania
Shervin Minaee
Shervin Minaee
Applied AI Team, Amazon, Seattle, USA
Radu Tudor Ionescu
Radu Tudor Ionescu
Professor, University of Bucharest, Romania
Computer VisionMachine LearningAIComputational LinguisticsMedical Imaging
N
N. Sebe
Department of Information Engineering and Computer Science, University of Trento, 9 via Sommarive, Povo-Trento, 38123, Italy