Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Ultra-high-definition (4K) image synthesis has long been hindered by the absence of standardized benchmarks and computational bottlenecks. To address this, we introduce Aesthetic-4K—the first standardized dataset specifically designed for 4K synthesis—and propose Diffusion-4K, an end-to-end high-fidelity generative framework. Our key contributions include: (1) a novel Scale-Consistent VAE jointly optimized with Wavelet-based Latent Fine-tuning (WLF) to balance reconstruction fidelity and inference efficiency; (2) two new evaluation metrics—GLCM Score and Compression Ratio—that jointly quantify texture detail preservation and global structural fidelity; and (3) integration of GPT-4o–assisted automatic annotation, multi-scale assessment (GLCM/CLIPScore/FID), and compatibility with large foundation models (e.g., Flux-12B). Experiments demonstrate a 32% reduction in FID and a 41% improvement in GLCM Score over state-of-the-art methods. Both code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at https://github.com/zhang0jhon/diffusion-4k.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for ultra-high-resolution image synthesis

Computational constraints in generating ultra-high-resolution images

Need for comprehensive evaluation metrics for ultra-high-resolution images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aesthetic-4K dataset with GPT-4o captions

Diffusion-4K framework with SC-VAE and WLF

GLCM Score and Compression Ratio metrics

🔎 Similar Papers

No similar papers found.