Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving photorealism and structural consistency in zero-shot synthetic video generation. We propose a fine-tuning-free, structure-aware re-rendering framework that leverages pre-trained diffusion-based video models conditioned on multi-level structural priors—including depth, semantic, and edge maps—to jointly enforce geometric, semantic, and boundary consistency across both spatial and temporal domains during denoising. Our key innovation lies in directly integrating lightweight auxiliary model-derived structural signals into the diffusion process, enabling explicit, cross-layer and inter-frame structural guidance. Experiments demonstrate that our method preserves state-of-the-art visual realism while significantly outperforming existing zero-shot re-rendering approaches in structural fidelity—particularly in motion coherence and physically plausible object deformation. The framework thus establishes an efficient and reliable generative paradigm for open-set video editing.

Technology Category

Application Category

📝 Abstract
We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
Problem

Research questions and friction points this paper is trying to address.

Enhancing synthetic video realism without fine-tuning using diffusion models
Preserving structural consistency through depth, semantic and edge maps
Achieving photorealistic rendering while maintaining original video structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot framework preserves multi-level video structures
Structure-aware denoising uses depth, semantic and edge maps
Diffusion model enhanced without fine-tuning for photorealism