Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian

๐Ÿ“… 2024-08-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of open-vocabulary 3D scene querying and poor generalization in semantic segmentation for autonomous driving, this paper proposes a novel framework integrating language-embedded 3D Gaussian representations with a lightweight LLM. Methodologically: (1) It pioneers injecting learnable language embeddings into 3D Gaussian splatting to enable cross-modal alignment; (2) it introduces an โ€œauxiliary positive tokenโ€ mechanism to enhance fine-grained semantic responsiveness; (3) it establishes a micro-fine-tuning pipeline for edge-deployable small LLMs (e.g., Phi-3), balancing accuracy and real-time inference. On the WayveScenes101 benchmark, our method significantly outperforms predefined-phrase baselines; the fine-tuned compact model achieves segmentation accuracy comparable to GPT-3.5 Turbo while accelerating inference by 3.2ร—. Ablation studies confirm the efficacy of each component.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.
Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary 3D scene querying
Enhanced segmentation with LLMs
Efficient on-device LLM deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for scene segmentation
GPT-3.5 Turbo dataset creation
Fine-tuned efficient on-device LLMs
A
Amirhosein Chahe
Drexel University, Philadelphia PA 19104, USA
Lifeng Zhou
Lifeng Zhou
Assistant Professor, Drexel University
Robotics