TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing uncertainty estimation methods struggle to capture trajectory-level failure signals arising from sparse critical events—such as loops, incoherent tool usage, or misaligned collaboration—in multi-turn human-agent tool interactions. This work proposes TRACER, the first trajectory-level uncertainty quantification framework tailored for dual-control interactions between tool-augmented agents and users. TRACER identifies salient anomalies by aggregating content-aware surprisal, context-aware signals, semantic and lexical repetitiveness, and gaps in tool-grounding consistency. It introduces trajectory-level risk modeling into tool-using agents for the first time, combining a tail-focused risk functional with MAX-composite stepwise risk to effectively detect locally confident yet globally failing interaction segments. Evaluated on the τ²-bench, TRACER achieves a 37.1% improvement in AUROC and a 55% gain in AUARC, substantially outperforming current baselines.

Technology Category

Application Category

📝 Abstract

Estimating uncertainty for AI agents in real-world multi-turn tool-using interaction with humans is difficult because failures are often triggered by sparse critical episodes (e.g., looping, incoherent tool use, or user-agent miscoordination) even when local generation appears confident. Existing uncertainty proxies focus on single-shot text generation and therefore miss these trajectory-level breakdown signals. We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction. TRACER combines content-aware surprisal with situational-awareness signals, semantic and lexical repetition, and tool-grounded coherence gaps, and aggregates them using a tail-focused risk functional with a MAX-composite step risk to surface decisive anomalies. We evaluate TRACER on $\tau^2$-bench by predicting task failure and selective task execution. To this end, TRACER improves AUROC by up to 37.1% and AUARC by up to 55% over baselines, enabling earlier and more accurate detection of uncertainty in complex conversational tool-use settings. Our code and benchmark are available at https://github.com/sinatayebati/agent-tracer.

Problem

Research questions and friction points this paper is trying to address.

uncertainty estimation

trajectory-level failure

tool-using agents

critical episodes

human-agent interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-level uncertainty

critical episodes

tool-agent interaction