FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional image tokenization employs fixed-length tokens, failing to accommodate inherent image complexity and thus limiting generative flexibility and efficiency. This paper introduces the first variable-length 1D semantic tokenization framework for images: it hierarchically compresses 2D images into 1–256 semantically ordered, discrete tokens. To ensure robustness to token count in both reconstruction and generation, we propose a co-training mechanism integrating nested dropout with a rectified flow decoder. Furthermore, we adopt GPT-style autoregressive modeling augmented with text conditioning. Evaluated on ImageNet, our method achieves FID < 2 using only 8–128 tokens—significantly outperforming TiTok and matching state-of-the-art methods. Crucially, it enables adaptive visual vocabulary generation—from coarse-grained to fine-grained—without architectural modification.

Technology Category

Application Category

📝 Abstract
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine"visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.
Problem

Research questions and friction points this paper is trying to address.

Adaptive 1D tokenization for images with variable complexity.
Hierarchical compression of 2D images into flexible-length sequences.
Enabling high-quality generation with fewer tokens than existing methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resamples 2D images into variable-length 1D token sequences
Employs rectified flow decoder with nested dropout training
Enables coarse-to-fine generation using GPT-style Transformer architecture
🔎 Similar Papers