3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the critical dependency of 3D representation learning on costly, manually annotated 3D scan data. We propose LAM3C—the first purely video-driven, self-supervised framework for 3D pretraining. LAM3C operates solely on unlabeled videos: it reconstructs point clouds from video frames to model geometric structure, introduces a noise-regularized reconstruction loss to enhance feature robustness, and employs multi-level Laplacian-aware clustering coupled with Sinkhorn-Knopp optimization to enforce geometrically consistent feature learning. Crucially, LAM3C achieves high-performance 3D representation pretraining without any ground-truth 3D supervision—marking the first such result. On indoor semantic and instance segmentation benchmarks, LAM3C consistently outperforms existing self-supervised methods, demonstrating that raw video serves as a scalable, effective, and viable source for large-scale 3D pretraining.

Technology Category

Application Category

📝 Abstract

Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

Problem

Research questions and friction points this paper is trying to address.

Learning 3D representations from unlabeled videos without real 3D sensors

Creating a scalable pre-training dataset from video-generated point clouds

Improving 3D semantic and instance segmentation via self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning from video-generated point clouds

Noise-regularized loss for geometric smoothness and stability

Scalable pre-training without real 3D sensors or scans

🔎 Similar Papers

P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders