VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

๐Ÿ“… 2026-04-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

185K/year
๐Ÿค– AI Summary
This work addresses the challenge that conventional autoregressive image generation models suffer from computational costs that scale prohibitively with image resolution, hindering support for arbitrary resolutions and aspect ratios. The authors propose VibeToken, a resolution-agnostic 1D Transformer-based image tokenizer that dynamically encodes images into 32โ€“256 tokens, along with a lightweight autoregressive generator, VibeToken-Gen, which decouples token count from image resolution for the first time. Using only 64 tokens, the method generates 1024ร—1024 images with a gFID of 3.94, maintaining a constant inference cost of 179G FLOPsโ€”63.4ร— more efficient than comparable autoregressive modelsโ€”while substantially narrowing the generation quality gap with diffusion models.
๐Ÿ“ Abstract
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.
Problem

Research questions and friction points this paper is trying to address.

autoregressive image generation
arbitrary resolution
image tokenization
computational efficiency
dynamic resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

resolution-agnostic
autoregressive image generation
1D image tokenizer
dynamic token sequence
computational efficiency