CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of unsupervised few-shot domain adaptation for 3D point clouds from synthetic to real-world scenarios by proposing an efficient framework that avoids reliance on heavy trainable encoders. The method projects point clouds into multi-view depth maps and leverages a frozen CLIP backbone combined with a lightweight 3D geometric encoder to integrate linguistic priors with geometric cues. It further incorporates knowledge-driven prompt tuning, parameter-efficient fine-tuning, entropy-guided view sampling, and uncertainty-aware prototype alignment. As the first CLIP-based approach for few-shot unsupervised domain adaptation in 3D point clouds, the framework achieves accuracy improvements of 3%–16% over CLIP baselines and conventional methods on PointDA-10 and GraspNetPC-10, significantly enhancing cross-domain 3D perception performance.

Technology Category

Application Category

📝 Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

Problem

Research questions and friction points this paper is trying to address.

3D point cloud

domain adaptation

unsupervised

few-shot

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

few-shot unsupervised domain adaptation

CLIP-based 3D perception

prompt tuning