Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing LLM-based agents primarily rely on post-execution intervention for safety, lacking controllable supervision and proactive risk mitigation during multi-step planning. This work introduces a planning-level safety paradigm that identifies and blocks potential harms before task execution. Our contributions are threefold: (1) AuraGen, a data engine synthesizing high-risk multi-step planning trajectories; (2) Safiron, a lightweight guardian model enabling cross-planner compatibility and two-stage training; and (3) Pre-Exec Bench, a novel evaluation benchmark addressing critical gaps in planning safety across data, modeling, and assessment. Integrating controllable synthesis, automated reward filtering, and fine-grained attribution, our approach achieves superior performance over baselines in human-validated complex scenarios—demonstrating high-precision risk detection, interpretable decision-making, and strong generalization.

Technology Category

Application Category

📝 Abstract

While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Problem

Research questions and friction points this paper is trying to address.

Developing pre-execution safety guardrails for LLM agents to prevent harmful actions

Addressing data, model, and evaluation gaps in current agent safety research

Creating synthetic data and benchmarks for robust risk detection in planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data engine generates labeled risk trajectories

Cross-planner adapter unifies inputs for guardian model

Realistic benchmark evaluates pre-execution safety generalization

🔎 Similar Papers

Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents