High-Resolution Image Synthesis via Next-Token Prediction

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

162K/year
🤖 AI Summary
This work addresses key challenges in high-resolution (up to 4K) text-to-image generation—namely, difficulty in autoregressive modeling, low detail fidelity, and poor cross-scale consistency. To this end, we propose D-JEPA·T2I, the first architecture to extend continuous-token next-token prediction to text-to-image synthesis. It introduces Vision-oriented Positional Encoding (VoPE) to enhance spatial modeling, integrates flow-matching loss with an online-learned discriminative module for dynamic sampling, and pioneers a data-feedback training mechanism. Built upon a multimodal vision transformer, the framework enables joint embedding prediction and supports unconditional, class-conditional, and text-conditional generation at arbitrary resolutions. Experiments demonstrate state-of-the-art performance on 4K text-to-image synthesis, with significant improvements in photorealism, local detail fidelity, and cross-resolution consistency.

Technology Category

Application Category

📝 Abstract
Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce extbf{D-JEPA$cdot$T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.
Problem

Research questions and friction points this paper is trying to address.

Explores next-token prediction for high-resolution text-to-image generation.
Introduces D-JEPA·T2I for photorealistic images up to 4K resolution.
Proposes innovative architecture and training strategies for continuous resolution learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive model with continuous tokens
Denoising joint embedding predictive architecture
Flow matching loss with Visual Rotary Positional Embedding
🔎 Similar Papers
No similar papers found.