LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the reliance of 3D vision-language understanding models on costly, dense 3D annotations. We propose a weakly supervised training paradigm requiring only 2D images, camera poses, and 2D supervision (or pseudo-labels). Our core method introduces differentiable rendering as a cross-modal bridge, using rendered reconstructions as latent variables to distill 2D supervision into 3D representations. The framework supports arbitrary architectures—including decoder-only models—and integrates seamlessly with self-supervised pretraining. To our knowledge, this is the first work to systematically incorporate rendering-based supervision into 3D vision-language joint modeling, unifying 2D–3D alignment, language grounding, and cross-scene generalization. On 3D vision-language grounding benchmarks, our approach significantly outperforms prior state-of-the-art methods and mainstream 3D pretraining approaches, achieving substantial gains in generalization across unseen scenes and domains.

Technology Category

Application Category

📝 Abstract

Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques. Project page: https://liftgs.github.io.

Problem

Research questions and friction points this paper is trying to address.

Trains 3D vision-language models without 3D labels

Uses 2D supervision and differentiable rendering

Outperforms existing 3D pretraining techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D supervision for 3D tasks

Employs differentiable rendering techniques

Leverages pseudo-labels from pretrained models

🔎 Similar Papers

No similar papers found.