Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

๐Ÿ“… 2025-06-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address token redundancy in discrete image representation and inefficiency in high-resolution modeling, this paper introduces a novel generative paradigm based on a 1D binarized latent space, replacing conventional discrete tokens with compact binary vector sequences to drastically reduce sequence length. Methodologically, we design a lightweight VQ-VAE encoder and a unified discrete generative framework compatible with both diffusion and autoregressive modeling. Our key contribution is the first demonstration of high-fidelity 1024ร—1024 text-to-image generation using only 128 discrete tokensโ€”reducing token count by 32ร— compared to standard VQ-VAE. The model trains efficiently: global batch size of 4096 on a single GPU node, completed in 200 GPU-days, without private data or post-processing, yet matching state-of-the-art performance. This paradigm significantly enhances efficiency and scalability for multimodal understanding and generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.
Problem

Research questions and friction points this paper is trying to address.

Advancing 1D binary latent space for compact image representation
Reducing token numbers for high-resolution image generation
Improving training and inference speed with simple architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

1D binary image latents for compact representation
128 discrete tokens for 1024x1024 images
Global batch size 4096 on single GPU node
๐Ÿ”Ž Similar Papers
No similar papers found.