Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the significant reliability gap between large language model (LLM)-driven web agents and humans in long-horizon real-world tasks, noting that existing evaluation methods struggle to pinpoint the root causes of failure. The authors propose a novel three-layer hierarchical planning framework—comprising high-level planning, low-level execution, and adaptive replanning—and, for the first time, systematically diagnose failure mechanisms from a process-oriented perspective. By integrating PDDL-based formal strategies with natural language plans and introducing multi-level process evaluation, they identify low-level execution as the primary bottleneck while highlighting the critical roles of perception alignment and adaptive replanning. Experiments show that PDDL-generated strategies are more concise and goal-directed, yet execution-level limitations remain the key performance constraint, offering a new direction for improving agent reliability.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Problem

Research questions and friction points this paper is trying to address.

LLM-based web agents

hierarchical planning

failure analysis

long-horizon tasks

perceptual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical planning

LLM web agents

PDDL