AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the complex safety risks faced by AI agents in autonomous tool use and environmental interaction, where existing safeguards lack systematic risk characterization and interpretable diagnostics. To bridge this gap, we propose the first three-dimensional orthogonal risk taxonomy for AI agents—spanning risk sources, failure modes, and consequences—and leverage it to develop ATBench, a fine-grained safety benchmark, along with AgentDoG, a diagnostic defense framework enabling trajectory-level, context-aware monitoring and root-cause tracing. Experimental results demonstrate that our approach significantly outperforms current safety auditing methods across diverse interactive scenarios, overcoming the limitations of conventional binary safety labels. The code, models (Qwen/Llama series, 4B–8B), and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
Problem

Research questions and friction points this paper is trying to address.

AI agent safety
security risks
risk diagnosis
guardrail framework
autonomous tool use
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagnostic Guardrail
Agentic Safety
Three-dimensional Risk Taxonomy
ATBench
Agent Alignment
D
Dongrui Liu
Shanghai Artificial Intelligence Laboratory
Qihan Ren
Qihan Ren
Shanghai Jiao Tong University
Explainable AIMachine LearningComputer VisionNatural Language Processing
C
Chen Qian
Shanghai Artificial Intelligence Laboratory
S
Shuai Shao
Shanghai Artificial Intelligence Laboratory
Yuejin Xie
Yuejin Xie
Huazhong University of Science and Technology
LLM SafetyTrustworthy AI
Y
Yu Li
Shanghai Artificial Intelligence Laboratory
Z
Zhonghao Yang
Shanghai Artificial Intelligence Laboratory
Haoyu Luo
Haoyu Luo
Xi'an Jiaotong University
Computer VisionPattern RecognitionContinual Learning
P
Peng Wang
Shanghai Artificial Intelligence Laboratory
Qingyu Liu
Qingyu Liu
Electronic and Computer Engineering, Peking University
wireless networkingmobile networkinginternet of thingsintelligent transportation
B
Binxin Hu
Shanghai Artificial Intelligence Laboratory
L
Ling Tang
Shanghai Artificial Intelligence Laboratory
Jilin Mei
Jilin Mei
Research Center for Intelligent Computing Systems, Institute of Computing Technology, University of Chinese Academy of Sciences
autonomous driving
D
Dadi Guo
Shanghai Artificial Intelligence Laboratory
L
Lei Yuan
Shanghai Artificial Intelligence Laboratory
J
Junyao Yang
Shanghai Artificial Intelligence Laboratory
Guanxu Chen
Guanxu Chen
Shanghai Jiao Tong University
Trustworthy AIInterpretability
Q
Qihao Lin
Shanghai Artificial Intelligence Laboratory
Y
Yi Yu
Shanghai Artificial Intelligence Laboratory
B
Bo Zhang
Shanghai Artificial Intelligence Laboratory
J
Jiaxuan Guo
Shanghai Artificial Intelligence Laboratory
Jie Zhang
Jie Zhang
Unknown affiliation
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
H
Huiqi Deng
Shanghai Artificial Intelligence Laboratory
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
W
Wenjie Wang
Shanghai Artificial Intelligence Laboratory
Wenxuan Wang
Wenxuan Wang
Institute of Automation, Chinese Academy of Sciences||Beijing Academy of Artificial Intelligence
Vision Language ModelComputer VisionMedical Image Analysis
W
Wen Shen
Shanghai Artificial Intelligence Laboratory
Z
Zhikai Chen
Shanghai Artificial Intelligence Laboratory
H
Haoyu Xie
Shanghai Artificial Intelligence Laboratory
Jialing Tao
Jialing Tao
Alibaba
J
Juntao Dai
Shanghai Artificial Intelligence Laboratory
J
Jiaming Ji
Shanghai Artificial Intelligence Laboratory
Zhongjie Ba
Zhongjie Ba
Zhejiang University
IoT security
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design
Yong Liu
Yong Liu
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
Medical Image AnalysisBrain NetworkNeuroImagingAlzheimer's Disease
Quanshi Zhang
Quanshi Zhang
Shanghai Jiao Tong University
Interpretable Machine Learning
L
Lei Zhu
Shanghai Artificial Intelligence Laboratory
Z
Zhihua Wei
Shanghai Artificial Intelligence Laboratory
H
Hui Xue
Shanghai Artificial Intelligence Laboratory
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model
Xia Hu
Xia Hu
Google DeepMind
Deep LearningMachine LearningMultimodal