Rethinking Patch Dependence for Masked Autoencoders

📅 2024-01-25

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 4

career value

205K/year

🤖 AI Summary

This work investigates the roles of masked-patch self-attention and masked-to-visible cross-attention in the MAE decoder for representation learning, revealing that image reconstruction primarily relies on global semantic representations extracted by the encoder—not on intra-masked-patch interactions within the decoder. Motivated by this finding, we propose CrossMAE: a streamlined framework that retains only the cross-attention mechanism while entirely removing self-attention among masked tokens in the decoder. This design is the first to empirically demonstrate that MAE’s effectiveness stems from the encoder’s strong global modeling capacity, challenging the prevailing assumption that decoder-side modeling of dependencies among masked patches is essential. Evaluated across ViT-S to ViT-H architectures, CrossMAE matches or surpasses standard MAE in performance while reducing GPU memory consumption by 37% and FLOPs by 42%. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io

Problem

Research questions and friction points this paper is trying to address.

Examines inter-patch dependencies in MAE decoders for representation learning

Proposes CrossMAE using only cross-attention to reduce computation

Challenges mask token interaction necessity in masked pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention only in MAE decoder

Global representation learning in encoder

Reduced computation with comparable performance

🔎 Similar Papers

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization