How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges non-expert users face in accurately interpreting the performance and risks of Robot Foundation Models (RFMs) on unseen tasks, particularly due to misinterpretations of existing evaluation metrics. It presents the first systematic investigation into how non-experts utilize evaluation information—such as task success rates, descriptions of failure cases, and demonstration videos—to make risk judgments. Through a user study grounded in real RFM data, the work combines qualitative and quantitative analyses to reveal that while users can reasonably interpret success rates, they heavily rely on supplementary failure examples and strongly desire access to both historical evaluation data and real-time performance predictions for new tasks. These findings highlight insufficient transparency in current RFM evaluation frameworks and provide empirical grounding for designing more interpretable and user-friendly interaction mechanisms.

Technology Category

Application Category

📝 Abstract
Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.
Problem

Research questions and friction points this paper is trying to address.

Robot Foundation Models
task success rate
user understanding
performance evaluation
non-expert interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robot Foundation Models
Task Success Rate
User Understanding
Failure Case Analysis
Human-Robot Interaction
I
Isaac S. Sheidlower
Brown University, Providence, RI, USA
Jindan Huang
Jindan Huang
Tufts University
Human-Computer InteractionHuman-Robot InteractionHuman-Centered AIRLHF
J
James Staley
Tufts University, Medford, MA, USA
B
Bingyu Wu
Tufts University, Medford, MA, USA
Q
Qicong Chen
Tufts University, Medford, MA, USA
R
R. Aronson
Tufts University, Medford, MA, USA
E
Elaine Short
Tufts University, Medford, MA, USA