🤖 AI Summary
Existing Masked Autoencoders (MAEs) employ random masking, disregarding inter-patch information content variability and downstream task requirements, thereby limiting representation discriminability and generalization. To address this, we propose an end-to-end differentiable, downstream-aware mask learning framework that, for the first time, backpropagates downstream task gradients into the MAE pretraining masking selection process. Our method jointly optimizes task-oriented dynamic masking policies across multiple levels, enabling gradient-driven mask scheduling without requiring additional annotations. It supports plug-and-play integration of arbitrary downstream task feedback signals. Extensive experiments demonstrate consistent and significant improvements over MAE and other baselines across diverse vision benchmarks—including image classification, object detection, and semantic segmentation—validating both the effectiveness and generality of task-driven masking for self-supervised representation learning.
📝 Abstract
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency.