LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing social navigation research primarily emphasizes path efficiency and collision avoidance, overlooking the task intent and social norms encoded in natural language instructions. To address this gap, we propose LISN-Bench—the first simulation benchmark supporting language-instruction-driven social navigation—and introduce Social-Nav-Modulator, a hierarchical control framework. It employs a vision-language model (VLM) to implement a fast-slow dual-loop architecture, decoupling high-level semantic reasoning from low-level motion control. The system dynamically modulates cost maps and controller parameters to jointly satisfy task objectives, social constraints (e.g., personal space, no-go zones), and real-time obstacle avoidance. Evaluated on LISN-Bench, our method achieves a 91.3% average success rate—63 percentage points higher than the best baseline—with particularly strong performance on challenging tasks such as pedestrian following and strict adherence to restricted areas.

Technology Category

Application Category

📝 Abstract

Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

Problem

Research questions and friction points this paper is trying to address.

Develops a benchmark for language-instructed social navigation tasks

Proposes a hierarchical system using VLM to modulate robot navigation parameters

Enhances robot adaptability in dynamic environments with user instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based controller modulating costmaps and parameters

Fast-slow hierarchical system decoupling action generation

Simulation benchmark incorporating instruction following and scene understanding

🔎 Similar Papers

SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning