TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of latent safety risks that accumulate implicitly during intermediate steps of multi-turn LLM agent interactions with tools and environments, which are difficult to mitigate via traditional post-hoc auditing. To enable proactive intervention, the authors propose TRACESβ€”a trajectory- and state-based active safety auditing framework that leverages hidden representations from an observer LLM to learn prefix-level risk states and model their temporal evolution for anticipating unsafe behaviors. Notably, TRACES is the first method capable of generating dense prefix-level risk estimates using only weak trajectory-level supervision, thereby eliminating reliance on fine-grained annotations, and further demonstrates that such risk states can guide the training of safer agents. Experiments across multiple agent safety benchmarks show that TRACES significantly improves both full-trajectory safety prediction and early risk detection, validating the efficacy of active auditing in long-horizon agent safety.
πŸ“ Abstract
LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
multi-turn interaction
safety auditing
proactive risk detection
trajectory safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive auditing
trajectory-state modeling
multi-turn LLM agents
latent risk representation
weak supervision