🤖 AI Summary
In physical human-robot interaction (pHRI), insufficient natural language expression of robot intent impedes user comprehension and collaborative efficiency. To address this, we propose the first task-agnostic, vision-language-driven intent communication framework. Our method integrates human pose estimation with 3D trajectory visual encoding, leveraging a multimodal vision-language model (VLM) to jointly infer high-level intent, motion dynamics, and required user collaborative actions—generating context-adaptive natural language descriptions. Unlike prior approaches relying on handcrafted rules or task-specific templates, ours enables end-to-end, cross-task generalizable intent–language alignment. Evaluated on real-world assistive tasks—including feeding, bathing, and shaving—our framework significantly improves communication clarity (p < 0.01) and increases user accuracy in interpreting both robot intent and required collaboration by 42%.
📝 Abstract
Clear communication of robot intent fosters transparency and interpretability in physical human-robot interaction (pHRI), particularly during assistive tasks involving direct human-robot contact. We introduce CoRI, a pipeline that automatically generates natural language communication of a robot's upcoming actions directly from its motion plan and visual perception. Our pipeline first processes the robot's image view to identify human poses and key environmental features. It then encodes the planned 3D spatial trajectory (including velocity and force) onto this view, visually grounding the path and its dynamics. CoRI queries a vision-language model with this visual representation to interpret the planned action within the visual context before generating concise, user-directed statements, without relying on task-specific information. Results from a user study involving robot-assisted feeding, bathing, and shaving tasks across two different robots indicate that CoRI leads to statistically significant difference in communication clarity compared to a baseline communication strategy. Specifically, CoRI effectively conveys not only the robot's high-level intentions but also crucial details about its motion and any collaborative user action needed.