🤖 AI Summary
This study investigates whether geometry-grounded visual features outperform purely semantic features in distilling radiance fields, addressing three core questions: (1) whether geometry guidance improves geometric fidelity of semantic features; (2) whether it enhances spatial localization of semantic objects; and (3) whether it boosts radiance field inversion accuracy. To this end, we propose SPINE—the first end-to-end framework for joint semantic distillation and geometry optimization without requiring an initial pose guess—integrating DINO/CLIP semantic features with Gaussian splatting and photometric optimization for coarse-to-fine radiance field reconstruction. Experiments show that while geometry-grounded features encode richer structural details, they do not surpass pure visual features in pose estimation or object localization tasks; conversely, visual-semantic features exhibit superior downstream generalizability. Our key contribution is the first systematic characterization of the functional divergence between visual and geometric features in radiance fields, establishing a new paradigm for open-vocabulary spatial understanding in robotics.
📝 Abstract
Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.