Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing text-to-image models suffer from reliance on massive private datasets, poor reproducibility, and inefficient image tokenizer training. To address these issues, this paper introduces TA-TiTok—a lightweight, efficient one-dimensional text-aware image tokenizer that incorporates textual information directly during decoding and supports both discrete and continuous token representations. Leveraging TA-TiTok, we develop MaskGen, a family of text-to-mask generation models trained exclusively on fully open-source data. Our key contributions are: (1) the first text-aware one-dimensional decoding mechanism; (2) an end-to-end, single-stage training paradigm that eliminates conventional two-stage distillation; and (3) the first high-performance, fully open-source text-to-mask framework with publicly released weights. Experiments demonstrate that TA-TiTok significantly improves training efficiency, while MaskGen achieves generation quality comparable to models trained on private data—despite using only open-source data. All code, models, and tokenizers are publicly available.

Technology Category

Application Category

📝 Abstract

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

Problem

Research questions and friction points this paper is trying to address.

text-to-image conversion

training difficulties

private dataset limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

TA-TiTok

MaskGen

Efficient Training Method

🔎 Similar Papers

No similar papers found.