Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Current LLM-driven web agents often suffer from inadequate planning strategies, leading to poor exploration, omission of critical steps, and high sensitivity to task constraints. This work proposes PlanAhead, a framework that systematically evaluates—for the first time—the impact of four natural language planning representations (subgoal sequences, narrative descriptions, pseudocode, and checklists) on the performance of multimodal LLM agents. It introduces an innovative, annotation-free three-tier automatic task difficulty classification method and proposes two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Experimental results demonstrate that both the choice of planning representation and the underlying LLM significantly affect agent robustness and success rates, with pronounced differences especially evident in high-difficulty tasks.

📝 Abstract

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

Problem

Research questions and friction points this paper is trying to address.

LLM web agents

planning representations

task planning

agent robustness

plan formulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

plan representation

LLM web agents

PlanAhead framework