π€ AI Summary
This work addresses the limitations of existing Vision Transformer (ViT)-based feedforward methods for novel view synthesis, which suffer from low input resolution and a lack of 3D consistency in their generation modules, leading to loss of high-frequency details and structural inconsistencies across views. To overcome these issues, we propose a novel framework that integrates a dual-domain detail-aware module with a feature-guided one-step diffusion network. Our approach preserves ViTβs geometric priors while leveraging 3D Gaussian splatting to achieve high-resolution, high-fidelity, and multi-view consistent rendering. Crucially, we unify high-resolution detail enhancement with 3D-aware geometric representation in a joint optimization framework, co-training the ViT backbone and the diffusion refinement module. Experiments demonstrate that our method significantly outperforms existing feedforward approaches across multiple benchmarks.
π Abstract
We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.