🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in 3D point cloud–language understanding and generation, lacking fine-grained semantic modeling and controllable generation mechanisms. To address this, we propose the first MLLM unifying point cloud and natural language modeling. Our method introduces: (1) a point cloud–language joint representation framework integrating a Point-BERT variant encoder, cross-modal attention, and a progressive point cloud decoder; (2) the Pyramid-XL annotation engine, enabling the release of a million-scale, fine-grained 3D text-pair dataset; and (3) geometry-faithful, color-consistent controllable point cloud enhancement from low- to high-quality inputs. Evaluated on a newly established 3D point cloud language understanding benchmark, our model achieves state-of-the-art performance across point cloud captioning, visual question answering, and conditional generation—outperforming prior work by +12.6% in PSNR and +9.4% in F-Score.
📝 Abstract
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation. Still, their understanding of the 3D world needs to be improved, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative, groundbreaking point-language multimodal model explicitly designed for unified 3D object understanding and generation within the MLLMframework. GPT4Point, as a powerful 3D MLLM, can seamlessly execute point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, and it can get high-quality results through a low-quality point-text feature that maintains geometric shapes and colors. We develop Pyramid-XL, a point-language dataset annotation engine, to support the expansive needs of 3D object-text pairs. It constructs a large-scale database of over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.