🤖 AI Summary
This work addresses key challenges in high-resolution (up to 4K) text-to-image generation—namely, difficulty in autoregressive modeling, low detail fidelity, and poor cross-scale consistency. To this end, we propose D-JEPA·T2I, the first architecture to extend continuous-token next-token prediction to text-to-image synthesis. It introduces Vision-oriented Positional Encoding (VoPE) to enhance spatial modeling, integrates flow-matching loss with an online-learned discriminative module for dynamic sampling, and pioneers a data-feedback training mechanism. Built upon a multimodal vision transformer, the framework enables joint embedding prediction and supports unconditional, class-conditional, and text-conditional generation at arbitrary resolutions. Experiments demonstrate state-of-the-art performance on 4K text-to-image synthesis, with significant improvements in photorealism, local detail fidelity, and cross-resolution consistency.
📝 Abstract
Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce extbf{D-JEPA$cdot$T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.