VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

๐Ÿ“… 2025-09-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing feedforward 3D Gaussian Splatting (3DGS) methods rely on pixel-aligned Gaussian prediction, making them vulnerable to variations in input view count, occlusions, and low-texture regionsโ€”leading to biased density distributions and multi-view inconsistency. To address this, we propose VolSplat, the first method to introduce voxel-aligned Gaussian prediction: it replaces error-prone 2D feature matching with direct, geometry-adaptive Gaussian generation from a learnable 3D voxel grid. This paradigm enables joint modeling of multi-view geometry and rendering within a single feedforward network, supporting scene-complexity-aware density control. Evaluated on RealEstate10K and ScanNet, VolSplat achieves state-of-the-art performance, delivering substantial improvements in novel-view synthesis quality, multi-view consistency, and 3D reconstruction fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.
Problem

Research questions and friction points this paper is trying to address.

Overcomes pixel alignment limitations in 3D Gaussian reconstruction
Addresses view-biased density distributions and alignment errors
Ensures robust multi-view consistency for novel view synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voxel-aligned Gaussians replace pixel-aligned prediction
Direct Gaussian prediction from a 3D voxel grid
Adaptive density control based on 3D complexity
๐Ÿ”Ž Similar Papers
No similar papers found.
Weijie Wang
Weijie Wang
PhD Student, Zhejiang University
Computer VisionEfficient AIDeep Learning
Y
Yeqing Chen
University of Electronic Science and Technology of China
Z
Zeyu Zhang
GigaAI
H
Hengyu Liu
The Chinese University of Hong Kong
H
Haoxiao Wang
Zhejiang University
Z
Zhiyuan Feng
Tsinghua University
Wenkang Qin
Wenkang Qin
Peking University
Z
Zheng Zhu
GigaAI
Donny Y. Chen
Donny Y. Chen
Researcher at ByteDance Seed (Singapore)
Computer VisionComputer GraphicsAffective Computing
Bohan Zhuang
Bohan Zhuang
Zhejiang University
Efficient AIMLSys