How can we assess human-agent interactions? Case studies in software agent design

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing LLM-agent evaluation benchmarks predominantly assume full automation, neglecting realistic human-AI collaboration scenarios. Method: We propose PULSE—the first systematic evaluation framework explicitly designed for human-agent co-execution—integrating explicit user feedback with a satisfaction prediction model trained on over 15,000 real-world user interactions, augmented by pseudo-labeling and rigorously validated via A/B testing. Contribution/Results: Experiments reveal a significant negative correlation between scores on mainstream benchmarks and developers’ actual satisfaction; core architectural components—including LLM backbones, planning strategies, and memory mechanisms—are substantially underestimated in their impact on user experience by conventional metrics. Moreover, PULSE reduces confidence intervals of key evaluation metrics by 40%, establishing a more reliable, interpretable, and user-centered assessment paradigm for software agents.

Technology Category

Application Category

📝 Abstract

LLM-powered agents are both a promising new technology and a source of complexity, where choices about models, tools, and prompting can affect their usefulness. While numerous benchmarks measure agent accuracy across domains, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy the framework on a large-scale web platform built around the open-source software agent OpenHands, collecting in-the-wild usage data across over 15k users. We conduct case studies around how three agent design decisions -- choice of LLM backbone, planning strategy, and memory mechanisms -- impact developer satisfaction rates, yielding practical insights for software agent design. We also show how our framework can lead to more robust conclusions about agent design, reducing confidence intervals by 40% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results and benchmark performance (e.g., the anti-correlation between results comparing claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our findings provide guidance for evaluations of LLM agents with humans and identify opportunities for better agent designs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating human-agent interaction in collaborative software systems

Assessing impact of LLM design choices on user satisfaction rates

Addressing limitations of automated benchmarks for real-world usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

PULSE framework combines human feedback with ML predictions

Large-scale web platform collects real-world user interaction data

Case studies analyze LLM backbone and planning strategy impacts

🔎 Similar Papers

No similar papers found.