Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D Gaussian Splatting (3DGS) scenes lack generalizable, transferable feature representations. Method: We propose Chorus—the first end-to-end, feed-forward holistic encoder pretraining framework for 3DGS. It jointly distills multi-source 2D foundation models (language-aligned, general-purpose vision, and object-aware) to construct a unified semantic-geometric representation. Chorus introduces a shared embedding space that fuses high-level semantics with fine-grained geometry, and designs a lightweight, point-cloud-compatible variant relying solely on Gaussian centers, colors, and normals, along with a render-and-distill cross-domain adaptation strategy. Contributions/Results: Chorus significantly outperforms baselines on open-vocabulary segmentation and linear probing. Its lightweight variant surpasses point-cloud-based methods using 39.9× fewer training scenes, while enabling efficient few-shot learning and cross-domain transfer.

Technology Category

Application Category

📝 Abstract
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

Develop holistic 3D Gaussian Splatting encoder via multi-teacher distillation
Enable rich feature extraction from 3DGS primitives for diverse tasks
Transfer learned representations to point cloud benchmarks efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-teacher pretraining for holistic 3D Gaussian scene encoding
Distills complementary signals from 2D foundation models
Shared encoder with teacher-specific projectors for embedding space