π€ AI Summary
This work addresses the limited explicit reasoning capability of existing 3D multimodal models in point cloud understanding, which struggle to align language with the unstructured nature of point clouds. To bridge this gap, the authors systematically introduce Chain-of-Thought (CoT) reasoning into 3D point cloud language understanding for the first time, proposing a data-centric two-stage framework. This framework integrates vision-language model evaluation, reference-guided refinement, and human-in-the-loop prompt optimization (HiLPO) to construct PoCoTIβthe first large-scale 3D CoT dataset featuring explicit reasoning paths. Leveraging this dataset, they fine-tune PointLLM-R, a 3D multimodal language model endowed with reasoning capabilities. Experiments demonstrate that PointLLM-R achieves state-of-the-art performance on generative 3D classification and captioning tasks and exhibits strong generalization on real-world scanned point clouds and multi-turn dialogue scenarios.
π Abstract
Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.