🤖 AI Summary
This work addresses the reliance of 3D vision-language understanding models on costly, dense 3D annotations. We propose a weakly supervised training paradigm requiring only 2D images, camera poses, and 2D supervision (or pseudo-labels). Our core method introduces differentiable rendering as a cross-modal bridge, using rendered reconstructions as latent variables to distill 2D supervision into 3D representations. The framework supports arbitrary architectures—including decoder-only models—and integrates seamlessly with self-supervised pretraining. To our knowledge, this is the first work to systematically incorporate rendering-based supervision into 3D vision-language joint modeling, unifying 2D–3D alignment, language grounding, and cross-scene generalization. On 3D vision-language grounding benchmarks, our approach significantly outperforms prior state-of-the-art methods and mainstream 3D pretraining approaches, achieving substantial gains in generalization across unseen scenes and domains.
📝 Abstract
Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques. Project page: https://liftgs.github.io.