๐ค AI Summary
To address the challenges of severe geometric ambiguity and poor generalization in single-image 3D scene reconstruction, this paper proposes the first generalizable, text-guided Transformer framework for monocular 3D Gaussian Splatting modeling and novel-view synthesis. Methodologically: (1) We design a text-embedding-driven cross-attention mechanism to enhance semantic understanding from a single input image; (2) we introduce 3D point features as explicit spatial priors to constrain geometric structureโmarking the first such integration in this context; (3) we formulate an end-to-end generalizable architecture that reconstructs unseen scenes without fine-tuning. Evaluated on large-scale single-view datasets, our method achieves state-of-the-art performance: +2.1 dB PSNR and โ18% LPIPS improvement over prior works. Crucially, it preserves structural consistency and texture fidelity across diverse, unseen scenes, demonstrating strong zero-shot generalization capability.
๐ Abstract
Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.