TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work proposes the first fully automatic track-then-label framework for open-world referring expression segmentation in 3D Gaussian splatting, addressing the limitations of existing methods that rely on costly manual annotations and view-wise pseudo-masks, which often suffer from multi-view inconsistency and poor generalization across query granularities. By decoupling object discovery from semantic localization, the approach introduces a trajectory-aware semantic consensus module (TSCM), visibility-aware description generation, and a hybrid training strategy based on multi-positive contrastive learning. This design significantly enhances multi-view consistency and query robustness without requiring any human-labeled data. The method achieves state-of-the-art performance across multiple benchmarks while eliminating the need for manual annotation.

📝 Abstract

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

referring segmentation

3D Gaussian Splatting

multi-view consistency

open-world

semantic grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

track-then-label

multi-view consistency

3D Gaussian Splatting