🤖 AI Summary
This study addresses the inefficiency and inaccuracy in human error identification during the supervision of intelligent agent behavior trajectories, often caused by information overload or insufficiency. Through three user studies, the authors systematically evaluate the verification utility of baseline trajectories, explore three alternative design approaches, and propose a novel interactive interface. The interface significantly reduces the time users require to locate errors and enhances their decision confidence, though it does not yield a statistically significant improvement in final judgment accuracy. The findings uncover critical challenges in human–AI collaborative verification, including users’ implicit assumptions about agent behavior and their subjective criteria for correctness, thereby offering empirical insights to inform future designs of explainable AI systems and human–agent collaboration.
📝 Abstract
To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three alternatives via design probes, and test a novel interface's impact on error finding in question-answering tasks. As expected, we find that current practices are cumbersome, limiting their efficacy. Conversely, our proposed design reduced the time participants spent finding errors. However, although participants reported higher levels of confidence in their decisions, their final accuracy was not meaningfully improved. To this end, our study surfaces challenges for human verification of agentic systems, including managing built-in assumptions, users' subjective and changing correctness criteria, and the shortcomings, yet importance, of communicating the agent's process.