PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the challenge of delayed feedback and limited intervention capability in long-horizon tool-use tasks performed by large language model (LLM) agents. The authors propose PrefixGuard, a novel framework that integrates trajectory abstraction with supervised monitoring to enable early failure detection. By offline induction of structured StepView representations, PrefixGuard trains a lightweight prefix risk monitor that provides interpretable intermediate signals during task execution. Key innovations include an observability-aware diagnostic mechanism to distinguish between monitoring errors and insufficient evidence, alongside a suite of components: typed step adapters, a supervised scorer, DFA-based posterior extraction, and AUPRC upper-bound analysis. Evaluated on WebArena, τ²-Bench, SkillsBench, and TerminalBench, PrefixGuard achieves AUPRC scores of 0.900, 0.710, 0.533, and 0.557 respectively—outperforming text-based baselines by an average of +0.137 and significantly surpassing LLM judge performance under the same protocol.

📝 Abstract

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, $τ^2$-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and $τ^2$-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas $τ^2$-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

online failure warning

prefix monitoring

trace-based monitoring

actionable intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

PrefixGuard

LLM-agent monitoring

StepView induction