Masked Autoencoders are Robust Data Augmentors

📅 2022-06-10
🏛️ arXiv.org
📈 Citations: 27
Influential: 0
📄 PDF
🤖 AI Summary
Deep neural networks are prone to overfitting in visual classification tasks, and conventional image augmentation techniques—largely relying on linear geometric or photometric transformations—struggle to generate semantically consistent yet discriminative hard examples. To address this, we propose Mask-Reconstruct Augmentation (MRA), the first method to integrate masked autoencoders (built upon Vision Transformer architectures) into supervised, semi-supervised, and few-shot classification pipelines. MRA employs stochastic block masking coupled with joint pixel-level reconstruction and classification training, yielding nonlinear, semantically coherent distorted views. Crucially, it enables model-driven hard-example generation, overcoming the limitations of hand-crafted augmentation heuristics. Extensive experiments across benchmarks—including ImageNet—demonstrate that MRA consistently improves classification accuracy and generalization performance across supervised, semi-supervised, and 5-shot settings, validating its robustness and broad applicability.
📝 Abstract
Deep neural networks are capable of learning powerful representations to tackle complex vision tasks but expose undesirable properties like the over-fitting issue. To this end, regularization techniques like image augmentation are necessary for deep neural networks to generalize well. Nevertheless, most prevalent image augmentation recipes confine themselves to off-the-shelf linear transformations like scale, flip, and colorjitter. Due to their hand-crafted property, these augmentations are insufficient to generate truly hard augmented examples. In this paper, we propose a novel perspective of augmentation to regularize the training process. Inspired by the recent success of applying masked image modeling to self-supervised learning, we adopt the self-supervised masked autoencoder to generate the distorted view of the input images. We show that utilizing such model-based nonlinear transformation as data augmentation can improve high-level recognition tasks. We term the proposed method as extbf{M}ask- extbf{R}econstruct extbf{A}ugmentation (MRA). The extensive experiments on various image classification benchmarks verify the effectiveness of the proposed augmentation. Specifically, MRA consistently enhances the performance on supervised, semi-supervised as well as few-shot classification.
Problem

Research questions and friction points this paper is trying to address.

Overcoming over-fitting in deep neural networks
Generating hard augmented examples beyond linear transformations
Improving image classification via model-based nonlinear augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked autoencoder for data augmentation
Generates nonlinear distorted image views
Improves supervised and semi-supervised classification
🔎 Similar Papers
No similar papers found.