UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Diffusion Transformers face severe coupled failures in native 4K multi-aspect-ratio text-to-image generation, including positional encoding mismatch, VAE compression artifacts, and training instability. To address these challenges, we propose a data–model co-design framework: (1) constructing the MultiAspect-4K-1M multilingual 4K dataset; (2) introducing autoregressive-aware Resonance 2D RoPE+YaRN positional encoding; (3) adopting non-adversarial VAE post-training with SNR-Aware Huber Wavelet loss; and (4) designing a staged aesthetic curriculum learning strategy. Our approach significantly improves detail fidelity and training stability across diverse aspect ratios (landscape, square, portrait). At 4096 resolution, the model matches or surpasses Seedream 4.0 in fidelity, aesthetic quality, and text alignment. This work delivers the first open-source, fully reproducible systematic solution for ultra-high-resolution text-to-image synthesis.

Technology Category

Application Category

📝 Abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

Problem

Research questions and friction points this paper is trying to address.

Extending diffusion transformers to native 4K resolution across diverse aspect ratios

Addressing coupled failure modes in positional encoding, VAE compression, and optimization

Achieving stable high-quality text-to-image generation at 4K with multi-aspect ratio support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Resonance 2D RoPE with YaRN for AR-aware positional encoding

SNR-Aware Huber Wavelet objective rebalancing frequency gradients

Stage-wise Aesthetic Curriculum Learning with high-noise supervision

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining