ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing vision Transformer-based image tokenizers exhibit poor generalization beyond their training resolution and rely on adversarial losses that hinder stable scaling. This work proposes NaFlex, an architecture natively supporting multi-resolution and arbitrary aspect ratio inputs, and introduces a DINOv3-based perceptual loss to replace LPIPS and GAN objectives, enabling stable large-scale training. The authors present the first 5-billion-parameter image autoencoder, achieving state-of-the-art reconstruction quality at 256p and significantly outperforming all baselines at 512p and higher resolutions. Furthermore, coupling the autoencoder with a flow-matching generator effectively advances the Pareto frontier between reconstruction fidelity and generative capability.

📝 Abstract

Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B images and scaled to 5B parameters, the largest image autoencoder to date. ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p and outperforms all baselines at 512p and above. In joint scaling experiments with flow matching generators, we show that scaling both the autoencoder and the generator advances the Pareto frontier of this trade-off.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformer

autoencoder

resolution generalization

adversarial loss

scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

native resolution

DINOv3 perceptual loss

NaFlex