In Pursuit of Pixel Supervision for Visual Pre-training

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates pixel-level supervised visual self-supervised pretraining, aiming to balance representation generality, methodological simplicity, and training stability. We propose Pixio, an enhanced masked autoencoder that reconstructs images end-to-end in pixel space, employs a high-capacity Vision Transformer (ViT) architecture, and leverages a self-filtering data curation strategy—enabling efficient large-scale training on 2 billion web images. To our knowledge, this is the first systematic demonstration of competitive performance for purely pixel-space self-supervision at ultra-large scale, without requiring latent-space projection or complex momentum-based mechanisms. Pixio serves as both an effective alternative and a complementary approach to latent-space methods such as DINOv3. In downstream tasks—including monocular depth estimation (Depth Anything), feed-forward 3D reconstruction (MapAnything), semantic segmentation, and robotic learning—Pixio achieves state-of-the-art performance or significantly outperforms comparably scaled DINOv3 models.

Technology Category

Application Category

📝 Abstract
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed"Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
Problem

Research questions and friction points this paper is trying to address.

Develops an autoencoder model for self-supervised visual representation learning from pixels
Enhances masked autoencoder with challenging tasks and architecture for better performance
Trains on large-scale web images to compete across diverse downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced masked autoencoder with challenging tasks
Self-curation strategy on 2B web-crawled images
Competitive pixel-space self-supervised learning alternative
🔎 Similar Papers
No similar papers found.
Lihe Yang
Lihe Yang
The University of Hong Kong
Computer VisionDeep Learning
S
Shang-Wen Li
FAIR, Meta
Y
Yang Li
FAIR, Meta
X
Xinjie Lei
FAIR, Meta
D
Dong Wang
FAIR, Meta
A
Abdelrahman Mohamed
FAIR, Meta
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence
H
Hu Xu
FAIR, Meta