Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

📅 2025-10-19
🏛️ IEEE/RJS International Conference on Intelligent RObots and Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling household service robots to understand open-vocabulary semantic scenes for natural human-robot interaction, a task hindered by the fixed-category constraints of conventional semantic segmentation methods that impede flexible text-based queries and 3D mapping. To overcome this limitation, the authors propose DVEFormer, an efficient RGB-D Transformer-based model that learns pixel-wise dense visual embeddings via knowledge distillation from an Alpha-CLIP teacher network. DVEFormer unifies linear-probe segmentation, open-vocabulary natural language querying, and real-time 3D semantic mapping within a single framework. Evaluated on indoor datasets, the model achieves a real-time inference speed of 77.0 FPS while maintaining competitive segmentation accuracy, thereby significantly transcending the closed-set category barrier inherent in traditional segmentation paradigms.

Technology Category

Application Category

📝 Abstract
In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer – an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3FPS for the full model and 77.0FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.
Problem

Research questions and friction points this paper is trying to address.

dense visual embeddings
semantic segmentation
RGB-D perception
natural-language querying
mobile robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense Visual Embeddings
Knowledge Distillation
RGB-D Transformer
Text-aligned Embeddings
Real-time Semantic Understanding
🔎 Similar Papers
No similar papers found.
S
Söhnke Benedikt Fischedick
Neuroinformatics and Cognitive Robotics Lab, Technische Universität Ilmenau, 98693 Ilmenau, Germany
D
Daniel Seichter
Neuroinformatics and Cognitive Robotics Lab, Technische Universität Ilmenau, 98693 Ilmenau, Germany
B
Benedict Stephan
Neuroinformatics and Cognitive Robotics Lab, Technische Universität Ilmenau, 98693 Ilmenau, Germany
R
Robin Schmidt
Neuroinformatics and Cognitive Robotics Lab, Technische Universität Ilmenau, 98693 Ilmenau, Germany
Horst-Michael Gross
Horst-Michael Gross
Full Professor of Computer Science, Technische Universitaet Ilmenau
RoboticsCognitive RoboticsNeural NetworksDeep LearningHuman-Robot Interaction