Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses a critical gap in the evaluation of GUI-based intelligent agents, which has predominantly emphasized task success rates while overlooking fine-grained alignment with human behavior. The authors propose the first trajectory-level evaluation framework tailored to real-world production search systems, systematically comparing agent and human behaviors across three dimensions: task outcomes and effort, query formulation, and navigation through interface states. Through controlled user studies, multi-hop task designs, and trajectory alignment analyses, they find that although agents achieve human-comparable success rates and query similarity, their navigation strategies diverge markedly: humans favor content-driven exploratory paths, whereas agents adopt search-oriented, low-branching strategies. This study reveals a disconnection between task success and behavioral fidelity, establishing a new paradigm for evaluating agent behavior.

Technology Category

Application Category

📝 Abstract

LLM-driven GUI agents are increasingly used in production systems to automate workflows and simulate users for evaluation and optimization. Yet most GUI-agent evaluations emphasize task success and provide limited evidence on whether agents interact in human-like ways. We present a trace-level evaluation framework that compares human and agent behavior across (i) task outcome and effort, (ii) query formulation, and (iii) navigation across interface states. We instantiate the framework in a controlled study in a production audio-streaming search application, where 39 participants and a state-of-the-art GUI agent perform ten multi-hop search tasks. The agent achieves task success comparable to participants and generates broadly aligned queries, but follows systematically different navigation strategies: participants exhibit content-centric, exploratory behavior, while the agent is more search-centric and low-branching. These results show that outcome and query alignment do not imply behavioral alignment, motivating trace-level diagnostics when deploying GUI agents as proxies for users in production search systems.

Problem

Research questions and friction points this paper is trying to address.

GUI-agent

human-like behavior

trace-level evaluation

production search systems

behavioral alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

trace-level evaluation

GUI agents

human-agent behavior comparison