Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether vision transformers (ViTs) can be pre-trained on purely algorithmic, non-visual, and semantically void sequence data to improve data efficiency and convergence speed on downstream image tasks. We propose a patch-embedding-free pre-training paradigm that employs formal grammars to generate synthetic, non-image, non-semantic algorithmic sequences as input—enabling the model to acquire generic computational priors prior to any image-based training. This constitutes the first explicit injection of cross-modal inductive bias into ViTs without relying on real or synthetic visual data. When applied to ImageNet-1k with only 1% of the standard training budget for programmatic pre-training, our method yields a fine-tuning accuracy gain exceeding 1.7%, equivalent to augmenting the training set with an additional 28% of real images. The approach consistently enhances performance across diverse downstream vision tasks.

Technology Category

Application Category

📝 Abstract
Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.
Problem

Research questions and friction points this paper is trying to address.

Instilling generic inductive biases in vision transformers through procedural pretraining
Bypassing visual patch embedding to internalize abstract computational priors
Improving data efficiency and convergence speed in image-based training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretraining ViTs with procedurally-generated abstract data
Bypassing visual embeddings to internalize computational priors
Improving data efficiency and accuracy with minimal procedural warm-up