HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

πŸ“… 2026-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
This study addresses a critical gap in agent safety evaluation, which has predominantly focused on external attacks while overlooking latent unsafe behaviors arising from intrinsic flaws in non-adversarial settings. The authors propose a framework for β€œnon-adversarial intrinsic risk auditing,” introducing a benchmark dataset comprising 629 trajectories annotated under a unified five-constraint taxonomy to support three tasks: risk detection, risk-step localization, and failure-type identification. Experimental results reveal that while large language models achieve moderate performance in trajectory-level risk detection, they exhibit substantial deficiencies in fine-grained risk localization (Strict-F1 < 35%) and failure diagnosis. Moreover, existing safety mechanisms demonstrate poor transferability to this setting. This work establishes, for the first time, a systematic definition and evaluation of intrinsic risks in benign environments, positioning it as a novel challenge in agent safety research.

Technology Category

Application Category

πŸ“ Abstract
Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.
Problem

Research questions and friction points this paper is trying to address.

intrinsic risk
agent safety
non-attack trajectory
risk auditing
long-horizon execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

intrinsic risk
agent safety
HINTBench
non-attack auditing
long-horizon trajectory
πŸ”Ž Similar Papers