Suppressing Non-Semantic Noise in Masked Image Modeling Representations

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the issue that representations learned through Masked Image Modeling (MIM) often incorporate non-semantic noise, which degrades performance on downstream tasks. To mitigate this without requiring model retraining, the authors propose SOAP—a model-agnostic, plug-and-play post-processing method. SOAP leverages Principal Component Analysis (PCA) to construct a semantic invariance score and applies orthogonal projection to linearly transform patch-level representations, thereby enabling the first quantitative identification and removal of non-semantic components in MIM features. Evaluated across diverse MIM architectures, SOAP consistently enhances zero-shot inference performance, demonstrating both its effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Problem

Research questions and friction points this paper is trying to address.

Masked Image Modeling

non-semantic noise

representation learning

semantic invariance

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Image Modeling

Semantic Invariance

Non-Semantic Noise