Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of simultaneously achieving real-time performance and social awareness in robot navigation within dynamic environments, this paper proposes Vision-Language Attention Distillation (Vi-LAD). Vi-LAD introduces a novel attention-map-level knowledge distillation framework that transfers high-level social cognition capabilities from large vision-language models (VLMs) to a lightweight Transformer-based navigation policy. It fuses intermediate-layer attention representations from both a pre-trained vision-action model and a VLM to generate interpretable, socially aware traversability cost maps. These maps are integrated with a socially aware model predictive control (MPC) planner to enable end-to-end real-time navigation. Evaluated on a Husky robot platform, Vi-LAD achieves a 14.2–50% improvement in navigation success rate over state-of-the-art methods, significantly enhancing safety and motion fluency in complex human-robot coexistence scenarios.

Technology Category

Application Category

📝 Abstract
We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.
Problem

Research questions and friction points this paper is trying to address.

Distilling socially compliant navigation knowledge from Vision-Language Models.
Enhancing robot navigation using attention maps for socially aware motion planning.
Improving real-time robotic navigation efficiency and success rates in dynamic environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills navigation knowledge from Vision-Language Model
Uses attention maps for socially aware motion planning
Fine-tunes transformer model with novel distillation loss