GenFusion: Closing the Loop between Reconstruction and Generation via Videos

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A persistent modality gap exists between 3D reconstruction and generation: reconstruction typically requires dense multi-view inputs, whereas generation often operates from single- or zero-view conditions—stemming from misalignment between geometric constraints and generative priors. This work introduces the first reconstruction-driven video diffusion framework, unifying sparse-view RGB-D reconstruction with video diffusion modeling. Key contributions include: (1) a cyclic fusion paradigm that dynamically augments training views, overcoming viewpoint saturation; and (2) RGB-D rendering-guided conditional modeling coupled with an iterative generation-reconstruction closed-loop optimization. Evaluated under extremely sparse viewpoints and severe occlusions, the method achieves significant improvements in novel-view synthesis quality—surpassing state-of-the-art methods in PSNR and SSIM—while demonstrating markedly enhanced generalization and robustness.

Technology Category

Application Category

📝 Abstract
Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Bridging the conditioning gap between 3D reconstruction and generation
Aligning 3D constraints with generative priors for better synthesis
Overcoming viewpoint saturation in reconstruction and generation pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstruction-driven video diffusion model
Cyclical fusion pipeline for training
Progressive expansion of restoration frames
🔎 Similar Papers