Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses natural-language-guided 3D object localization in real-world scenes for robotics and AR devices. We propose 3D-JEPA, the first self-supervised representation learning framework tailored for point clouds: it introduces masked latent-space prediction to point cloud representation learning for the first time, integrates multi-modal features from foundational 2D models (CLIP/DINO), and designs a language-conditioned decoder that jointly regresses 3D masks and oriented bounding boxes. To support this, we construct the first large-scale, multi-scene 3D referring expression dataset—comprising over 130K annotated samples. Evaluated on standard benchmarks, 3D-JEPA achieves state-of-the-art performance, demonstrates significantly improved cross-scene generalization, and enables edge-deployable real-time inference, with successful validation on physical robots and AR hardware.

Technology Category

Application Category

📝 Abstract
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like"the small coffee table between the sofa and the lamp."LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
Problem

Research questions and friction points this paper is trying to address.

Localizing objects in 3D scenes from referring expressions
Enabling real-world deployment on robots and AR devices
Introducing a new dataset for 3D referential grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised 3D-JEPA for pointcloud learning
2D foundation models for 3D pointcloud featurization
Joint 3D mask and bounding box prediction
🔎 Similar Papers
No similar papers found.