Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of proactivity in LLM-based agents lack systematic, cross-source, long-horizon benchmarks. Method: We propose PROBE, the first formal framework that conceptualizes proactivity as a three-stage capability—problem discovery, bottleneck identification, and solution execution—and introduces a multi-stage evaluation pipeline supporting cross-context reasoning, long-term memory tracking, and action verification. Contribution/Results: PROBE is the first quantifiable, end-to-end benchmark for measuring agent proactivity. Empirical evaluation reveals that state-of-the-art models (e.g., GPT-5, Claude Opus-4.1) achieve only ~40% end-to-end success on realistic proactivity tasks, exposing fundamental deficiencies—including goal drift, context forgetting, and execution fragmentation. PROBE establishes a reproducible, scalable evaluation paradigm and identifies concrete avenues for improvement, thereby advancing rigorous, principled research on proactive LLM agents.

Technology Category

Application Category

📝 Abstract
LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.
Problem

Research questions and friction points this paper is trying to address.

Measuring proactive problem-solving capabilities in LLM agents
Evaluating reasoning across multiple sources and time horizons
Assessing autonomous issue identification and resolution in AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

PROBE framework measures proactive problem solving
Decomposes proactivity into three core capabilities
Evaluates LLMs through bottleneck identification pipeline
🔎 Similar Papers
No similar papers found.
G
Gil Pasternak
Fastino.ai
Dheeraj Rajagopal
Dheeraj Rajagopal
Research Scientist (Fastino AI, prev. Google Deepmind)
Artificial IntelligenceInformation ExtractionNatural Language Processing
J
Julia White
Fastino.ai
D
Dhruv Atreja
Fastino.ai
M
Matthew Thomas
Fastino.ai
G
George Hurn-Maloney
Fastino.ai
A
Ash Lewis
Fastino.ai