Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates scaling laws for autoregressive pixel prediction within unified vision models. Using IsoFlops-configured Transformer architectures, we systematically analyze scaling behavior across classification, generative modeling, and pixel prediction tasks on 32×32 images. Our findings are threefold: (1) Classification and generation follow fundamentally distinct optimal scaling trajectories—generation requires 3–5× higher data scaling rates than classification; (2) As input resolution increases, model parameter count must grow significantly faster than dataset size, rendering computation—not data—the dominant bottleneck; (3) Projected hardware advances suggest that practical, pixel-level image modeling will become feasible within the next five years. These results provide quantifiable, task-aware scaling principles to guide resource allocation and architectural design for unified visual representation learning.

Technology Category

Application Category

📝 Abstract
This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.
Problem

Research questions and friction points this paper is trying to address.

Investigates scaling properties of autoregressive next-pixel prediction for vision models
Determines optimal scaling strategies differ for classification versus generation tasks
Identifies compute as primary bottleneck rather than training data availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive next-pixel prediction for unified vision models
Optimal scaling strategy varies by task requirements
Compute identified as primary bottleneck over data
🔎 Similar Papers