EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective semantic priors in blind super-resolution (BSR). To this end, it introduces pre-trained text-to-image diffusion models—specifically Diffusion Transformers (DiT)—into BSR for the first time. The proposed Ψ-DiT module constructs a three-stream collaborative network, incorporating a separable flow injection mechanism to enable low-overhead feature fusion. A progressive masked image modeling strategy is further adopted to reduce training complexity. Additionally, the paper pioneers a subject-aware prompt generation method that enables context-driven, fine-grained semantic guidance. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in PSNR and SSIM—particularly in recovering complex textures and structural details.

Technology Category

Application Category

📝 Abstract
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
Problem

Research questions and friction points this paper is trying to address.

Leveraging Diffusion Transformers for Blind Super-Resolution enhancement
Introducing Ψ-DiT block for improved image restoration guidance
Proposing subject-aware prompts to optimize T2I diffusion prior utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Diffusion Transformers for Blind Super-Resolution
Introduces Ψ-DiT block for enhanced image restoration
Uses progressive Masked Image Modeling strategy
🔎 Similar Papers
No similar papers found.
H
Haizhen Xie
Huawei Noah’s Ark Lab
K
Kunpeng Du
Huawei Noah’s Ark Lab
Qi Yan
Qi Yan
PhD Student, University of British Columbia
machine learningrobotics
S
Sen Lu
Huawei Noah’s Ark Lab
J
Jianhong Han
Huawei Noah’s Ark Lab
Hanting Chen
Hanting Chen
Noah's Ark Lab, Huawei
deep learningmachine learningcomputer vision
Hailin Hu
Hailin Hu
Huawei Noah's Ark Lab
J
Jie Hu
Huawei Noah’s Ark Lab