π€ AI Summary
Existing 3D reconstruction methods exhibit limited extrapolation capability in under-observed regions, while generative priorβbased approaches often suffer from poor scalability and content inconsistency. This work proposes a novel two-stage paradigm: first, a bidirectional generative diffusion model is trained with a new opacity blending strategy to balance observation consistency and plausible extrapolation; second, this diffusion model is distilled into a causal autoregressive model that enables single-pass forward inference to generate hundreds of views and provides pseudo-supervision signals to refine the 3D representation. To our knowledge, this is the first method to distill a bidirectional diffusion model into an efficient autoregressive architecture for 3D extrapolation, achieving a 1β3 dB PSNR improvement on standard benchmarks and significantly outperforming existing baselines in reconstructing entirely unobserved regions.
π Abstract
Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.