BitDance: Scaling Autoregressive Generative Models with Binary Tokens

📅 2026-02-15
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes BitDance, a novel approach that overcomes the limitations of traditional autoregressive image generation models, which suffer from restricted expressivity and slow inference due to discrete codebook indices. BitDance uniquely integrates high-entropy binary visual tokens with a diffusion mechanism, enabling efficient sampling in continuous space via a binary diffusion head. It further introduces next-patch diffusion to achieve parallel autoregressive decoding. The method substantially enhances both model expressivity and generation speed: on ImageNet 256×256, it achieves a state-of-the-art FID of 1.24 among all autoregressive models. With only 260M parameters, BitDance accelerates inference by 8.7× compared to a 1.4B-parameter parallel autoregressive model and achieves over 30× speedup for 1024×1024 image generation.

Technology Category

Application Category

📝 Abstract
We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.
Problem

Research questions and friction points this paper is trying to address.

autoregressive models
image generation
binary tokens
scalability
inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

binary tokens
autoregressive generation
diffusion head
next-patch diffusion
scalable image synthesis
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3