TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
Existing transformer-based feed-forward 3D Gaussian splatting methods regress Gaussian means as depths along camera rays, which limits their performance due to dependencies on input resolution and number of views, and renders them sensitive to pose noise and multi-view inconsistencies. This work proposes TokenGS, the first approach to decouple Gaussian prediction from pixel space by introducing learnable Gaussian tokens within a transformer encoder-decoder architecture that directly regresses 3D Gaussian mean coordinates using only self-supervised rendering loss. The method supports flexible primitive counts, enables efficient test-time optimization, and naturally yields emergent properties such as static-dynamic scene decomposition and scene flow. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regular geometric structures, better-balanced Gaussian distributions, and significantly improved robustness.

Technology Category

Application Category

📝 Abstract
In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.
Problem

Research questions and friction points this paper is trying to address.

3D Gaussian Splatting
Transformer-based prediction
depth regression
primitive decoupling
multiview reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
learnable tokens
encoder-decoder architecture
self-supervised rendering
test-time optimization