Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

πŸ“… 2025-09-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The semantic segmentation performance of 3D point clouds is hindered by misalignment between point cloud and large language model (LLM) representations: dense point-wise pre-alignment at the input stage dilutes object-level semantics, while the absence of geometric guidance at the output stage degrades fine-grained accuracy. To address this, we propose an object-centric segmentation framework that eliminates the need for large-scale pre-alignment. Our key contributions are: (1) object-centric discriminative representation learning, explicitly decoupling geometric structure from high-level semantics; (2) hard negative-aware training combined with LLM–point cloud cross-modal feature fusion; and (3) a geometry-reactivation decoder that explicitly incorporates geometric cues to guide mask generation. Our method achieves +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer, and attains state-of-the-art performance across seven 3D understanding benchmarks, demonstrating strong robustness and generalization in multi-task scenarios.

Technology Category

Application Category

πŸ“ Abstract
3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses representation misalignment between LLMs and 3D point clouds
Resolves input-stage limitations requiring heavy pre-alignment of dense patches
Solves output-stage loss of fine-grained accuracy without geometric cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridged 3D-Language Model without pre-alignment
Object-centric Discriminative Representation tokens
Geometric Reactivation Decoder combines tokens and features
πŸ”Ž Similar Papers
No similar papers found.