SplatTalk: 3D VQA with Gaussian Splatting

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the challenging zero-shot 3D visual question answering (3D VQA) task under multi-view images with pose-only supervision—without 3D ground truth or dedicated 3D encoders. We propose the first general-purpose framework that leverages generalized 3D Gaussian Splatting (3DGS) as a geometry-semantic bridge: it reconstructs implicit 3D scenes from multi-view images and tokenizes them into structured 3D semantic representations, which are directly fed into pretrained vision-language models (VLMs/LLMs) for cross-modal alignment and reasoning. Our method enables open-set, cross-scene zero-shot 3D VQA without requiring explicit 3D annotations or specialized 3D architectures. Evaluated on multiple benchmarks, it outperforms both task-specific 3D models and pure 2D multimodal baselines, achieving performance on par with state-of-the-art 3D large multimodal models (3D LMMs) that rely on additional 3D inputs (e.g., depth maps or point clouds).

Technology Category

Application Category

📝 Abstract

Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction, enabling models to comprehend and interact with 3D environments through natural language. While 2D vision-language models (VLMs) have achieved remarkable success in 2D VQA tasks, progress in the 3D domain has been significantly slower due to the complexity of 3D data and the high cost of manual annotations. In this work, we introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM, enabling effective zero-shot 3D visual question answering (3D VQA) for scenes with only posed images. During experiments on multiple benchmarks, our approach outperforms both 3D models trained specifically for the task and previous 2D-LMM-based models utilizing only images (our setting), while achieving competitive performance with state-of-the-art 3D LMMs that additionally utilize 3D inputs.

Problem

Research questions and friction points this paper is trying to address.

Advancing 3D scene understanding using natural language

Overcoming 3D data complexity and annotation costs

Enabling zero-shot 3D visual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting for 3D tokens

Enables zero-shot 3D visual question answering

Outperforms 2D and 3D models on benchmarks

🔎 Similar Papers

No similar papers found.