Progressive Image Restoration via Text-Conditioned Video Generation

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Traditional image restoration methods for super-resolution, deblurring, and low-light enhancement overlook the untapped potential of text-conditioned video generation. Method: This work pioneers the integration of text-to-video (T2V) models—specifically CogVideo—into image restoration via a progressive refinement framework. It fine-tunes CogVideo to learn a temporal degradation-to-restoration trajectory, introduces a scene-aware intelligent prompting strategy leveraging LLaVA and ChatGPT for enhanced semantic coherence and interpretability, and constructs a synthetic dataset optimized jointly on PSNR, SSIM, and LPIPS. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art methods in spatial detail reconstruction, illumination consistency, and temporal coherence. Notably, the approach achieves zero-shot generalization on the ReLoBlur dataset, confirming strong robustness and cross-task transferability.

Technology Category

Application Category

📝 Abstract

Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

Problem

Research questions and friction points this paper is trying to address.

Repurposes video generation models for progressive image restoration tasks

Generates restoration trajectories from degraded to clean frames

Enhances spatial detail and illumination with temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned CogVideo for progressive restoration trajectories

Used synthetic datasets for super-resolution, deblurring, enhancement

Compared uniform and scene-specific prompting via LLaVA and ChatGPT

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion