Sample- and Parameter-Efficient Auto-Regressive Image Models

📅 2024-11-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
To address the low sample and parameter efficiency of autoregressive image models on large-scale imbalanced data, as well as their difficulty in capturing high-level structural semantics, this paper proposes XTRA: a block-wise autoregressive image modeling paradigm based on block causal masking, where *k*×*k* pixel blocks serve as the fundamental prediction unit—replacing conventional token-level modeling. XTRA integrates Vision Transformer architectures (ViT-B/16 and ViT-H/14) and enhances structural awareness via block-level causal constraints. Experiments demonstrate that XTRA-ViT-H/14 achieves state-of-the-art average Top-1 accuracy across 15 recognition benchmarks using only 13.1M samples—152× fewer than prior SOTA. XTRA-ViT-B/16 attains superior performance on linear and attention probing tasks with just 85M parameters—7–16× smaller than competing models. This work marks the first demonstration of simultaneous breakthroughs in both efficiency and representation quality for block-level autoregressive visual modeling.

Technology Category

Application Category

📝 Abstract
We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k $ imes$ k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152$ imes$ fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16$ imes$ fewer parameters (85M vs. 1.36B/0.63B).
Problem

Research questions and friction points this paper is trying to address.

Enhances sample and parameter efficiency in vision models
Improves auto-regressive image modeling with Block Causal Mask
Achieves higher accuracy with fewer samples and parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block Causal Mask for efficient token processing
Auto-regressive objective for scalable performance
Sample- and parameter-efficient vision model XTRA
🔎 Similar Papers
No similar papers found.