🤖 AI Summary
While large language model (LLM) agents possess explicit knowledge of risks, they frequently overlook or misjudge high-risk actions during execution, revealing a critical gap between risk awareness and safe operational behavior. Method: We propose a decoupled risk verification framework comprising (1) a three-tier evaluation scheme that quantifies the performance gap between risk identification and action execution, and (2) two independent, modular components—a risk verifier and a trajectory abstracter—that disentangle risk assessment from action generation, enabling generation-verification co-execution and abstraction-enhanced reasoning. Contribution/Results: Our approach requires no model fine-tuning and demonstrates consistent effectiveness across diverse LLMs and reasoning-specialized models. It reduces the execution rate of high-risk operations by 55.3%, significantly improving robustness and controllability in safety-critical applications.
📝 Abstract
Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents' risk awareness and safety execution abilities: while they often answer "Yes" to queries like "Is executing `sudo rm -rf /*' dangerous?", they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents' safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge ($>98%$ pass rates), they fail to apply this knowledge when identifying risks in actual scenarios (with performance dropping by $>23%$) and often still execute risky actions ($<26%$ pass rates). Notably, this trend persists across more capable LMs as well as in specialized reasoning models like DeepSeek-R1, indicating that simply scaling model capabilities or inference compute does not inherently resolve safety concerns. Instead, we take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by $55.3%$ over vanilla-prompted agents.