🤖 AI Summary
Existing LLM-based agents primarily rely on post-execution intervention for safety, lacking controllable supervision and proactive risk mitigation during multi-step planning. This work introduces a planning-level safety paradigm that identifies and blocks potential harms before task execution. Our contributions are threefold: (1) AuraGen, a data engine synthesizing high-risk multi-step planning trajectories; (2) Safiron, a lightweight guardian model enabling cross-planner compatibility and two-stage training; and (3) Pre-Exec Bench, a novel evaluation benchmark addressing critical gaps in planning safety across data, modeling, and assessment. Integrating controllable synthesis, automated reward filtering, and fine-grained attribution, our approach achieves superior performance over baselines in human-validated complex scenarios—demonstrating high-precision risk detection, interpretable decision-making, and strong generalization.
📝 Abstract
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.