P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenges of high computational cost in full fine-tuning and poor generalization due to overfitting in conventional prompt tuning for existing 3D vision-language models. The authors propose P3T, a novel approach that introduces dual point-level and textual prompting mechanisms: a point-level prompter generates instance-aware point cloud prompts, while learnable textual prompts replace handcrafted templates. Additionally, a prototype contrastive loss is designed to reduce intra-class variance and enhance alignment in the embedding space. With only a minimal number of parameters fine-tuned, P3T achieves performance on par with or surpassing full fine-tuning on both classification and few-shot tasks, while demonstrating exceptional robustness and generalization in cross-dataset settings.

Technology Category

Application Category

📝 Abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.

Problem

Research questions and friction points this paper is trying to address.

prompt tuning

3D vision-language models

generalization

overfitting

parameter-efficient adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Tuning

3D Vision-Language Models

Point Cloud