PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current 3D multimodal large language models (MLLMs) suffer from three key limitations: scarcity of high-quality point-cloud instruction data, significant modality and domain gaps between 2D vision and 3D geometry, and coarse-grained, category-imbalanced language descriptions in existing benchmarks—leading to inaccurate evaluation. To address these, we propose PiSA-Engine, the first point-cloud self-augmentation data engine that leverages collaborative reasoning across 2D/3D multimodal LLMs to generate spatially grounded, high-fidelity instruction data. We further introduce PointLLM-PiSA, a co-evolutionary training framework, and PiSA-Bench—the first fine-grained, generative-understanding-oriented 3D evaluation benchmark covering six dimensions. On PiSA-Bench, our method achieves zero-shot 3D object description and generative classification accuracies of 46.45% and 63.75%, respectively—surpassing state-of-the-art by 8.33% and 16.25%.

Technology Category

Application Category

📝 Abstract
3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
Problem

Research questions and friction points this paper is trying to address.

Limited quantity and quality of 3D datasets hinder 3D MLLMs.
Modality and domain gaps exist in transferring 2D MLLM knowledge.
Existing 3D benchmarks lack detailed labels and category diversity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

PiSA-Engine generates 3D instruction datasets.
Integrates 2D and 3D MLLMs for data generation.
Introduces PiSA-Bench for comprehensive 3D evaluation.
🔎 Similar Papers
No similar papers found.