Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

📅 2024-09-06
🏛️ arXiv.org
📈 Citations: 20
Influential: 3
📄 PDF
🤖 AI Summary
To address the challenge of subtoken prediction in autoregressive visual generation under ultra-large vocabularies (2¹⁸), this work introduces the first open-source family of autoregressive image generation models (300M–1.5B parameters). Our approach builds upon an enhanced MAGVIT-v2 tokenizer and proposes an asymmetric token decomposition mechanism, together with the novel “next subtoken prediction” paradigm—designed to improve local structural modeling and generation fidelity under extreme vocabulary scales. Evaluated on ImageNet 256×256, our best model achieves rFID = 1.17, setting a new state-of-the-art for reconstruction tasks. All models, training code, and the tokenizer are fully open-sourced, establishing a reproducible, scalable benchmark for research in autoregressive visual generation.

Technology Category

Application Category

📝 Abstract
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 imes 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce"next sub-token prediction"to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
Problem

Research questions and friction points this paper is trying to address.

Developing open-source auto-regressive image generation models
Enhancing reconstruction performance with large codebook tokenizer
Improving generation quality through sub-token prediction techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source replication of MAGVIT-v2 tokenizer
Asymmetric token factorization for large vocabulary
Next sub-token prediction enhances generation quality