ProactBench: Beyond What The User Asked For

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current evaluations of large language models predominantly focus on explicit instruction following, overlooking their capacity to proactively identify and respond to users’ implicit needs. This work introduces the concept of “conversational proactivity” and formalizes it into three distinct capabilities: Emergent, Critical, and Recovery. To assess this dimension, the authors develop ProactBench, a novel benchmark employing a three-agent architecture—comprising a Planner, a User Agent, and an Assistant Model—operating under information asymmetry. The framework integrates 24 psychometrically grounded communication styles, annotated trigger points, and an independent LLM-based adjudication mechanism to mitigate style confounding and scoring leakage. Empirical results demonstrate that Recovery tasks are particularly challenging and exhibit low predictability by six major existing benchmarks, thereby validating conversational proactivity as a distinct and necessary evaluation axis for LLMs.

📝 Abstract

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

Problem

Research questions and friction points this paper is trying to address.

conversational proactivity

implicit user needs

LLM benchmarking

proactive dialogue

unstated intentions

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational proactivity

ProactBench

information asymmetry