Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing CLIP-based 3D scene understanding methods rely on k-NN or radius-based tokenization, making them susceptible to dataset-specific spatial scale biases and limiting generalization. This paper proposes S4Token—the first unsupervised, scale-invariant 3D tokenization framework tailored for CLIP—integrating superpoint grouping, coordinate-scale normalization, and semantic-aware feature propagation. We introduce a novel joint training paradigm comprising masked point modeling, contrastive clustering loss, and multi-view image–3D cross-modal distillation, enabling self-supervised representation learning with a frozen CLIP backbone. Evaluated on ScanNet and S3DIS benchmarks, S4Token achieves +12.6% average accuracy gains in zero-shot 3D retrieval and open-vocabulary segmentation over conventional tokenization approaches, demonstrating superior cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

Problem

Research questions and friction points this paper is trying to address.

Improving cross-domain generalization in 3D tokenization for CLIP

Achieving scale-invariant 3D representation learning with frozen CLIP

Enabling semantic 3D tokenization without manual annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-invariant 3D tokenizer with CLIP

Superpoint grouping and scale normalization

Self-supervised training with cross-modal distillation

🔎 Similar Papers

Duoduo CLIP: Efficient 3D Understanding with Multi-View Images