๐ค AI Summary
To address token redundancy in discrete image representation and inefficiency in high-resolution modeling, this paper introduces a novel generative paradigm based on a 1D binarized latent space, replacing conventional discrete tokens with compact binary vector sequences to drastically reduce sequence length. Methodologically, we design a lightweight VQ-VAE encoder and a unified discrete generative framework compatible with both diffusion and autoregressive modeling. Our key contribution is the first demonstration of high-fidelity 1024ร1024 text-to-image generation using only 128 discrete tokensโreducing token count by 32ร compared to standard VQ-VAE. The model trains efficiently: global batch size of 4096 on a single GPU node, completed in 200 GPU-days, without private data or post-processing, yet matching state-of-the-art performance. This paradigm significantly enhances efficiency and scalability for multimodal understanding and generation.
๐ Abstract
Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.