AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing sparse-view 3D reconstruction methods suffer from limitations in geometric consistency and scene scalability, particularly struggling with large-scale or diverse scenes under arbitrary numbers of unordered inputs. To address these challenges, this work proposes AnyRecon, a novel framework that integrates persistent global 3D geometric memory with a geometry-aware conditioning mechanism to tightly couple generative and reconstructive processes. The approach introduces a pre-frame view cache to preserve inter-frame correspondences and leverages a video diffusion model enhanced with four-step distillation, sparse attention, and geometry-driven view retrieval. This design enables robust, high-quality, and scalable 3D reconstruction even under irregular inputs, large viewpoint gaps, and extended camera trajectories.

Technology Category

Application Category

📝 Abstract

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

Problem

Research questions and friction points this paper is trying to address.

Sparse-view 3D reconstruction

arbitrary-view reconstruction

geometric consistency

large-scale 3D scenes

video diffusion model

Innovation

Methods, ideas, or system contributions that make the work stand out.

arbitrary-view reconstruction

video diffusion model

geometry-aware conditioning