Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limited 3D geometric awareness of existing 2D vision foundation models, which hinders their performance on downstream tasks requiring spatial understanding. The authors propose a feedforward knowledge distillation framework in which a teacher model explicitly lifts 2D features into 3D representations via rapid 3D Gaussian reconstruction and generates supervision signals through multi-view projections to guide a lightweight student model in acquiring 3D-aware capabilities. By avoiding conventional per-scene optimization and eliminating feature averaging artifacts, the approach ensures consistent co-improvement between teacher and student. Experiments demonstrate that the resulting model significantly outperforms current methods on monocular depth estimation, surface normal prediction, multi-view correspondence, and semantic segmentation, while simultaneously enhancing both 3D geometric understanding and 2D semantic representation.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted"onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling"geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/

Problem

Research questions and friction points this paper is trying to address.

3D awareness

Vision Foundation Models

3D reconstruction

feature distillation

monocular depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware distillation

feed-forward 3D reconstruction

feature lifting