Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

πŸ“… 2025-11-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of unified, semantically rich holistic representations for 3D scenes, which hinders the integration of language-level reasoning and spatial understanding. To this end, we propose NDTokenizer3Dβ€”a novel framework that introduces multi-scale Normal Distribution Transform (NDT) representation coupled with a multi-scale decoder (MSDec) to map raw point clouds end-to-end into structured scene tokens. A three-stage tokenization pipeline enables decoder reuse for interactive prompting, mask generation, and cross-modal alignment, unifying referential segmentation, 3D visual question answering, and dense captioning. By incorporating point cloud feature fusion and a large language model (LLM) interface, the method significantly enhances fine-grained semantic comprehension. Extensive experiments on multiple 3D vision-language benchmarks demonstrate state-of-the-art performance, validating the framework’s generality and effectiveness.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
Problem

Research questions and friction points this paper is trying to address.

Tokenizing 3D scenes into holistic representations for vision-language models
Leveraging scene tokens across diverse 3D understanding tasks effectively
Bridging language reasoning with 3D spatial understanding through unified architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale NDT representation preserves geometric details
Cross-scale feature fusion produces holistic scene tokens
Unified architecture supports diverse 3D understanding tasks
πŸ”Ž Similar Papers
No similar papers found.