C3G: Learning Compact 3D Representations with 2K Gaussians

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing pixel-wise 3D Gaussian splatting methods for feedforward 3D scene reconstruction and understanding from pose-free sparse views suffer from high redundancy and inefficient multi-view feature aggregation. Method: We propose Token-GS, a token-guided Gaussian splatting framework that learns compact, geometry-aware Gaussians at salient spatial locations via learnable tokens, coupled with attention-driven Gaussian decoding and multi-view self-attention feature aggregation. It unifies 2D-to-3D feature enhancement within a single architecture. Contribution/Results: Token-GS achieves strong geometric awareness and significantly reduced memory footprint. It substantially improves feature fidelity and generalization in pose-free novel view synthesis and 3D open-vocabulary segmentation. Remarkably, it attains state-of-the-art performance using only ~2K Gaussians—yielding substantial memory efficiency gains over prior approaches.

Technology Category

Application Category

📝 Abstract

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from unposed sparse views

Reducing redundant Gaussians to lower memory overhead

Improving multi-view feature aggregation for scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact 3D Gaussians at essential spatial locations

Learnable tokens aggregate multi-view features via self-attention

Attention patterns guide Gaussian decoding for efficient feature lifting

🔎 Similar Papers

Gaussian-Det: Learning Closed-Surface Gaussians for 3D Object Detection