OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

216K/year

🤖 AI Summary

In high-density human-robot cohabitation scenarios (e.g., hospitals, nursing homes), service robots must dynamically interpret evolving social norms and generate socially compliant navigation trajectories in real time—yet existing approaches struggle to balance online adaptability with lightweight deployment. This paper introduces OLiVia-Nav, the first online lifelong visual-language social navigation framework. Its core innovation is Social Context-aware Contrastive CLIP Distillation (SC-CLIP), which efficiently transfers social reasoning capabilities from large vision-language models (VLMs) to a lightweight student model while enabling continual online learning and policy refinement for novel social contexts. The framework integrates social context encoding, multi-objective trajectory generation, and a trajectory selection mechanism. Real-world experiments demonstrate that OLiVia-Nav significantly reduces trajectory MSE, Hausdorff distance, and duration of personal space violations compared to state-of-the-art DRL and VLM-based methods. Ablation studies confirm the efficacy of each component.

Technology Category

Application Category

📝 Abstract

Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia- Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the robot trajectory planning overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for OLiVia-Nav.

Problem

Research questions and friction points this paper is trying to address.

Enhance robot navigation in human-centered environments.

Adapt to new social scenarios during navigation.

Improve social compliance in robot trajectory planning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-language models with lifelong learning

Uses SC-CLIP for social context encoding

Generates socially compliant robot trajectories

🔎 Similar Papers

Online Context Learning for Socially-compliant Navigation