YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper addresses the problem of efficiently reconstructing high-fidelity 3D Gaussian splatting representations from unstructured image collections, supporting inputs with or without camera poses and with calibrated or uncalibrated intrinsics. The proposed method introduces a hybrid training strategy that decouples learning of 3D Gaussian parameters from camera parameters, progressively transitioning from ground-truth to predicted poses. To resolve scale ambiguity, it incorporates pairwise camera distance normalization and intrinsic parameter embedding. An end-to-end network jointly predicts per-view local Gaussian distributions, extrinsic poses, and intrinsic parameters, followed by autoregressive refinement of the global representation. Evaluated on an NVIDIA GH200, the method reconstructs 100 images (280×518) in just 2.69 seconds and achieves state-of-the-art performance across multiple benchmarks, demonstrating both exceptional efficiency and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from unstructured image collections efficiently

Overcoming joint learning difficulties of 3D Gaussians and camera parameters

Handling both posed and unposed camera inputs with scale ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedforward model reconstructs 3D Gaussian splatting from images

Mixing training strategy mitigates pose and Gaussian entanglement

Pairwise normalization resolves scale ambiguity for unposed inputs

🔎 Similar Papers

No similar papers found.