Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates whether geometry-grounded visual features outperform purely semantic features in distilling radiance fields, addressing three core questions: (1) whether geometry guidance improves geometric fidelity of semantic features; (2) whether it enhances spatial localization of semantic objects; and (3) whether it boosts radiance field inversion accuracy. To this end, we propose SPINE—the first end-to-end framework for joint semantic distillation and geometry optimization without requiring an initial pose guess—integrating DINO/CLIP semantic features with Gaussian splatting and photometric optimization for coarse-to-fine radiance field reconstruction. Experiments show that while geometry-grounded features encode richer structural details, they do not surpass pure visual features in pose estimation or object localization tasks; conversely, visual-semantic features exhibit superior downstream generalizability. Our key contribution is the first systematic characterization of the functional divergence between visual and geometric features in radiance fields, establishing a new paradigm for open-vocabulary spatial understanding in robotics.

Technology Category

Application Category

📝 Abstract

Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether geometry-grounded semantic features improve distilled radiance fields

Evaluating geometry-grounding for spatial tasks like pose estimation and object localization

Developing SPINE framework for radiance field inversion using distilled semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-grounded semantic features enhance structural detail

SPINE framework inverts radiance fields without initial guess

Coarse and fine inversion using distilled semantics and optimization

🔎 Similar Papers

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers