Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the trade-off between high computational cost in dense multi-view video novel view synthesis and quality degradation (e.g., flickering, geometric distortion, and spatiotemporal inconsistency) under sparse input conditions, this paper proposes an efficient online 3D video synthesis method. Our approach introduces two key innovations: (1) a globally geometry-constrained real-time rendering paradigm, leveraging TSDF-based spatial modeling to enforce cross-view and inter-frame geometric consistency; and (2) a temporal color-difference masking guided progressive depth map optimization scheme, jointly enhanced by a pre-trained multi-view fusion network to improve photometric fidelity. The method retains online inference capability (≈30 FPS) while significantly suppressing flickering artifacts. It achieves state-of-the-art synthesis quality across multiple benchmarks, demonstrating superior geometric stability and visual coherence under sparse input settings.

Technology Category

Application Category

📝 Abstract

We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.

Problem

Research questions and friction points this paper is trying to address.

Achieving efficient high-quality novel-view video synthesis

Reducing multi-view and temporal inconsistencies in synthesis

Enhancing view and temporal consistency using geometry guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-guided image-based rendering pipeline

Progressive depth refinement using color masks

TSDF-accumulated view-consistent depth representation

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency