PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation paradigms for interactive agents overemphasize task success rate while neglecting holistic user experience across the entire interaction process. Method: This paper proposes the PIPA protocol, the first framework to model interactive task-planning agent behavior as a Partially Observable Markov Decision Process (POMDP). It decomposes the agent’s behavioral chain into atomic components—context understanding, tool invocation, and response generation—and introduces multi-granularity, interpretable evaluation metrics. Contribution/Results: By correlating intermediate behavioral steps with user satisfaction, PIPA uncovers uneven capability distributions across stages and empirically validates the substantial impact of intermediate behaviors on end-to-end user experience. It delivers actionable, fine-grained diagnostic insights to guide agent optimization and identifies concrete directions for advancing multi-agent coordination and user simulator design.

Technology Category

Application Category

📝 Abstract
The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating task planning agents beyond just task completion
Assessing user satisfaction through entire agentic process
Diagnosing agent strengths and weaknesses in decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation protocol for interactive planning agents
POMDP paradigm for agent behavior assessment
Atomic criteria diagnose decision-making pipeline
🔎 Similar Papers
No similar papers found.