Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image diffusion models struggle to directly generate high-quality 4K images due to the absence of publicly available 4K synthetic datasets and standardized evaluation protocols. To address this, we propose an end-to-end 4K image generation framework. Our key contributions are: (1) the first open-source 4K aesthetic benchmark, Aesthetic-4K; (2) a wavelet-domain fine-tuning method enabling efficient latent-space optimization for high-fidelity 4K detail preservation; and (3) a fine-grained evaluation metric—GLCM Score—integrated with FID, CLIPScore, and compression ratio for comprehensive multidimensional assessment. Leveraging SD3-2B and Flux-12B backbones with GPT-4o–assisted automatic annotation, our approach achieves significant improvements in perceptual detail fidelity and text–image alignment on Aesthetic-4K, outperforming all existing state-of-the-art methods across quantitative and qualitative metrics.

Technology Category

Application Category

📝 Abstract
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.
Problem

Research questions and friction points this paper is trying to address.

Lack of public 4K image synthesis dataset
Need for evaluating ultra-high-resolution image details
Direct training method for 4K image synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed Aesthetic-4K benchmark for 4K synthesis
Proposed wavelet-based fine-tuning for 4K images
Achieved high-quality ultra-high-resolution image synthesis
🔎 Similar Papers
No similar papers found.
Jinjin Zhang
Jinjin Zhang
Beihang University
Q
Qiuyu Huang
Meituan
J
Junjie Liu
Meituan
Xiefan Guo
Xiefan Guo
Beihang University
Generative AI
D
Di Huang
State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China; School of Computer Science and Engineering, Beihang University, Beijing 100191, China