🤖 AI Summary
This work addresses the severe degradation in rendering quality of 3D Gaussian Splatting (3DGS) when synthesizing out-of-distribution (OOD) novel views—i.e., viewpoints significantly distant from training poses. We propose the first point-cloud Transformer explicitly operating on the Gaussian point set, taking raw 3DGS outputs as input and performing end-to-end differentiable point-set refinement via a single forward pass—without requiring multi-scene joint training or auxiliary view supervision. Our core contribution is the first integration of a point Transformer architecture into 3DGS representation learning, enabling explicit modeling of geometric and appearance dependencies among Gaussians to enhance OOD generalization. Experiments demonstrate that our method substantially suppresses artifacts under extreme novel viewpoints and consistently outperforms leading 3DGS regularization techniques, sparse-view multi-scene models, and diffusion-augmented frameworks. It achieves state-of-the-art performance on OOD novel view synthesis.
📝 Abstract
3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.