Rethinking Patch Dependence for Masked Autoencoders

📅 2024-01-25
🏛️ arXiv.org
📈 Citations: 13
Influential: 4
📄 PDF
🤖 AI Summary
This work investigates the roles of masked-patch self-attention and masked-to-visible cross-attention in the MAE decoder for representation learning, revealing that image reconstruction primarily relies on global semantic representations extracted by the encoder—not on intra-masked-patch interactions within the decoder. Motivated by this finding, we propose CrossMAE: a streamlined framework that retains only the cross-attention mechanism while entirely removing self-attention among masked tokens in the decoder. This design is the first to empirically demonstrate that MAE’s effectiveness stems from the encoder’s strong global modeling capacity, challenging the prevailing assumption that decoder-side modeling of dependencies among masked patches is essential. Evaluated across ViT-S to ViT-H architectures, CrossMAE matches or surpasses standard MAE in performance while reducing GPU memory consumption by 37% and FLOPs by 42%. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract
In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
Problem

Research questions and friction points this paper is trying to address.

Examines inter-patch dependencies in MAE decoders for representation learning
Proposes CrossMAE using only cross-attention to reduce computation
Challenges mask token interaction necessity in masked pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention only in MAE decoder
Global representation learning in encoder
Reduced computation with comparable performance
🔎 Similar Papers
No similar papers found.