🤖 AI Summary
This work addresses natural-language-guided 3D object localization in real-world scenes for robotics and AR devices. We propose 3D-JEPA, the first self-supervised representation learning framework tailored for point clouds: it introduces masked latent-space prediction to point cloud representation learning for the first time, integrates multi-modal features from foundational 2D models (CLIP/DINO), and designs a language-conditioned decoder that jointly regresses 3D masks and oriented bounding boxes. To support this, we construct the first large-scale, multi-scene 3D referring expression dataset—comprising over 130K annotated samples. Evaluated on standard benchmarks, 3D-JEPA achieves state-of-the-art performance, demonstrates significantly improved cross-scene generalization, and enables edge-deployable real-time inference, with successful validation on physical robots and AR hardware.
📝 Abstract
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like"the small coffee table between the sofa and the lamp."LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.