🤖 AI Summary
This study addresses the problem of predicting human perceptual judgments of robot social navigation behavior under few-shot conditions. We propose a large language model (LLM)-based few-shot prompting method that encodes spatiotemporal trajectories of robots and humans into structured prompts and, for the first time, incorporates personalized contextual examples. Leveraging only a minimal amount of real-world human–robot interaction annotations—approximately one order of magnitude less than required by conventional supervised learning—we achieve accurate inference of subjective evaluations using the SEAN TOGETHER extended dataset. Experiments demonstrate that our method matches or surpasses traditional supervised models in performance; prediction accuracy improves with increasing contextual examples, and personalized examples yield significant further gains. Ablation studies confirm the efficacy of both spatiotemporal features and sensor inputs. This work establishes a scalable, practical paradigm for human–robot perception modeling under low-data regimes.
📝 Abstract
Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user's perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.