🤖 AI Summary
Existing approaches to generating social behaviors for robots rely heavily on predefined actions or human feedback, limiting their flexibility and autonomy. This work proposes CRISP, a novel framework that introduces a vision-language model (VLM) as an introspective “social critic,” enabling robots to autonomously evaluate and iteratively refine their behaviors. Requiring only the robot’s MJCF structural file—and no task-specific APIs or human intervention—CRISP generates natural, contextually appropriate motion sequences across diverse robotic platforms. By integrating context-driven planning, joint-motion visualization, and reward-based iterative search, CRISP significantly outperforms existing methods in user studies involving five robot types and twenty real-world scenarios, achieving higher scores in both behavioral preference and contextual appropriateness.
📝 Abstract
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability.
Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/