Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the sparsity and insufficient discriminability of semantic features in unsupervised pre-training of 3D point clouds, this paper proposes MaskClu—the first framework to integrate clustering learning into a point cloud masked autoencoding architecture. MaskClu jointly optimizes three objectives: predicting cluster assignments for masked regions, regressing cluster centers, and performing global instance-level contrastive learning. By enforcing dense clustering-based reconstruction, the model learns fine-grained semantic structures, while contrastive learning enhances instance-level representation discriminability. Extensive experiments demonstrate state-of-the-art performance across diverse downstream tasks—including part segmentation, semantic segmentation, object detection, and classification—on major 3D point cloud benchmarks. Notably, MaskClu significantly improves the semantic representational capacity and generalization ability of Vision Transformers (ViTs) for 3D point cloud understanding.

Technology Category

Application Category

📝 Abstract

Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.

Problem

Research questions and friction points this paper is trying to address.

Learning dense semantic features from point clouds using ViTs

Integrating masked modeling with clustering for unsupervised pre-training

Enhancing instance-level feature learning via global contrastive mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines masked modeling with clustering-based learning

Introduces global contrastive learning mechanism

Jointly optimizes dense semantic and instance-level learning

🔎 Similar Papers

Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud