🤖 AI Summary
Evaluating autonomous human-like robot behavior in human-robot interaction remains challenging, particularly in quantifying motion quality—conventional metrics (e.g., task success rate) fail to capture nuanced kinematic and dynamic aspects of movement.
Method: This paper proposes a trajectory-performance-based evaluation framework centered on the Neuron Meta-Evaluator (NeME), a self-supervised deep learning model that learns spatiotemporal joint trajectory patterns without human annotations, enabling automatic action quality classification and control policy ranking. The framework integrates imitation learning, trajectory analysis, and meta-evaluation modeling, trained and validated on the ergoCub platform and teleoperation datasets.
Contribution/Results: Experiments demonstrate that NeME’s assessments strongly correlate with ground-truth task success rates and significantly outperform baseline methods. The framework establishes a reproducible, systematic, and interpretable paradigm for automated, multimodal comparison of imitation learning policies.
📝 Abstract
Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.