PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing masked image modeling (MIM) methods suffer from granularity imbalance between pixel-level details and high-level semantic representations. To address this, we propose a unified pixel–latent collaborative MIM framework. Methodologically, we design a dual-decoder–single-encoder architecture and, for the first time, incorporate the [CLS] token into joint reconstruction of both pixel space and latent space, enabling simultaneous modeling of global context and multi-granularity features. A shared Vision Transformer (ViT) encoder and [CLS]-token enhancement mechanism ensure feature consistency and deep semantic fusion. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art MIM methods—including MAE, I-JEPA, and BootMAE—on downstream tasks such as image classification, object detection, and semantic segmentation. The learned representations exhibit enhanced visual richness and superior transferability across diverse vision benchmarks.

Technology Category

Application Category

📝 Abstract

In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.

Problem

Research questions and friction points this paper is trying to address.

Masked Image Modeling

Pixel MIM

Latent MIM

Innovation

Methods, ideas, or system contributions that make the work stand out.

PiLaMIM

Dual Decoders

CLS token

🔎 Similar Papers

Masked Image Modeling: A Survey