GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the limitations of existing diffusion-based super-resolution methods, which rely on text conditioning and struggle to preserve spatial alignment and fine-grained details from low-resolution inputs. To overcome this, the authors propose GramSR, a single-step diffusion framework that introduces dense visual features extracted by a pretrained DINOv2 model as a spatially aligned conditioning signal in lieu of textual prompts. GramSR incorporates a three-level LoRA module to separately refine reconstruction at the pixel, semantic, and texture levels. By integrating Gram matrix loss with multi-scale perceptual constraints, the method achieves state-of-the-art performance among single-step approaches on standard benchmarks, demonstrating superior structural fidelity and textural realism.

📝 Abstract

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using $\ell_2$ loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: https://github.com/aimagelab/GramSR.

Problem

Research questions and friction points this paper is trying to address.

super-resolution

diffusion models

visual feature conditioning

representation gap

real-world degradations

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based super-resolution

visual feature conditioning

DINOv3