CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the slow pretraining convergence of MAE-style methods caused by random masking and the inefficient parameter utilization stemming from fixed-resolution Vision Transformers (ViTs), this paper proposes the CoMA framework. First, it introduces a complementary masking strategy that ensures pixel-level uniform sampling, thereby enhancing the robustness of feature learning. Second, it designs a hierarchical Dynamic ViT (DyViT) architecture incorporating Dynamic Multi-Window Self-Attention (DM-MSA), achieving dual compression in both computation and parameters while improving fine-grained representation capability. On ImageNet-1K, CoMA attains comparable linear evaluation accuracy to MAE using only 12% of MAE’s pretraining epochs, with a 10% reduction in per-epoch training time. These improvements significantly enhance the efficiency of self-supervised representation learning and the adaptability to downstream tasks.

Technology Category

Application Category

📝 Abstract

Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improves MAE's slow convergence by implementing uniform pixel sampling strategy

Addresses ViT's parameter inefficiency through hierarchical dynamic window attention

Reduces pre-training epochs by 88% while maintaining downstream task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Complementary masking strategy ensures uniform pixel sampling

Hierarchical vision transformer with dynamic multi-window attention

Reduced parameters and FLOPs while improving feature learning

🔎 Similar Papers

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers