Self-supervised pretraining for an iterative image size agnostic vision transformer

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the computational inefficiency and limited adaptability to arbitrary image resolutions in existing vision Transformers during self-supervised pretraining. The authors propose a sequence-to-global self-supervised pretraining framework that, for the first time, integrates self-supervised learning into an iterative multi-scale Vision Transformer architecture, enabling truly resolution-agnostic pretraining. Building upon the DINO self-distillation objective, the method leverages integral-image-accelerated multi-scale patch extraction and a recursive, time-unrolled backpropagation mechanism to support inputs of any resolution while maintaining constant computational overhead. Experiments demonstrate competitive performance on ImageNet-1K and multiple downstream classification tasks, with inference computational cost remaining invariant to input resolution.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

self-supervised learning

image size agnostic

computational efficiency

resolution scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

vision transformer

resolution agnostic