More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

📅 2024-08-28

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

The scarcity of 3D-text paired data hinders large language models’ (LLMs) ability to comprehend the physical world. Method: This paper introduces “text-augmented point cloud–language understanding” and proposes a novel zero-parameter cross-modal alignment paradigm: leveraging a pretrained point cloud–text encoder, it integrates zero-parameter cross-attention token pooling with text-space mapping and amplification mechanisms, and employs a three-stage progressive alignment training strategy. We further construct a high-quality 3D semantic dataset comprising 6 million free-form textual descriptions. Contribution/Results: Experiments show that our method achieves state-of-the-art performance using only 12% labeled point cloud data. Remarkably, it retains strong 3D reasoning capability even under pure textual input, significantly advancing few-shot 3D vision–language tasks.

Technology Category

Application Category

📝 Abstract

Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data.

Problem

Research questions and friction points this paper is trying to address.

Enable LLMs to understand 3D objects with minimal point-text pairs

Compensate scarce 3D data using abundant text data via alignment

Achieve robust 3D understanding with only 12% of typical training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained point cloud-text encoder

Expands text space with 6M descriptions

Zero-parameter cross-attention for alignment

🔎 Similar Papers

No similar papers found.