π€ AI Summary
Diffusion Transformers face severe coupled failures in native 4K multi-aspect-ratio text-to-image generation, including positional encoding mismatch, VAE compression artifacts, and training instability. To address these challenges, we propose a dataβmodel co-design framework: (1) constructing the MultiAspect-4K-1M multilingual 4K dataset; (2) introducing autoregressive-aware Resonance 2D RoPE+YaRN positional encoding; (3) adopting non-adversarial VAE post-training with SNR-Aware Huber Wavelet loss; and (4) designing a staged aesthetic curriculum learning strategy. Our approach significantly improves detail fidelity and training stability across diverse aspect ratios (landscape, square, portrait). At 4096 resolution, the model matches or surpasses Seedream 4.0 in fidelity, aesthetic quality, and text alignment. This work delivers the first open-source, fully reproducible systematic solution for ultra-high-resolution text-to-image synthesis.
π Abstract
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.