An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work challenges the prevailing consensus that locality-inductive bias is indispensable in vision Transformers. It investigates whether pixel-level tokenization—bypassing conventional patch-based partitioning (e.g., 16×16) and convolutional priors—is both feasible and effective. Method: The authors directly serialize raw images into pixel-level tokens and adopt a standard Transformer architecture, trained via masked autoencoding and diffusion-model paradigms. They comprehensively evaluate performance across image classification, dense prediction, self-supervised reconstruction, and generative modeling. Contribution/Results: Experiments demonstrate that pure pixel-level Transformers achieve competitive or superior performance to state-of-the-art ViTs on multiple benchmarks—including ImageNet-1K, COCO, and ADE20K—without any explicit locality bias. This is the first empirical evidence that locality-inductive bias is not strictly necessary for high visual representation learning. The findings establish a new architectural paradigm and provide theoretical grounding for next-generation vision models grounded in sequence-based, bias-free design principles.

Technology Category

Application Category

📝 Abstract

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We showcase the effectiveness of pixels-as-tokens across three well-studied computer vision tasks: supervised learning for classification and regression, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although it's computationally less practical to directly operate on individual pixels, we believe the community must be made aware of this surprising piece of knowledge when devising the next generation of neural network architectures for computer vision.

Problem

Research questions and friction points this paper is trying to address.

Challenges the necessity of locality bias in vision Transformers.

Demonstrates Transformers can perform well using individual pixels as tokens.

Explores pixel-level Transformers in classification, autoencoding, and image generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers treat individual pixels as tokens

Challenges locality bias in Vision Transformers

Effective in classification, autoencoding, image generation

🔎 Similar Papers

No similar papers found.