🤖 AI Summary
Real-world AI agents face dynamic, context-dependent safety risks, yet traditional safeguards struggle to adapt continuously due to reliance on sparse and noisy user feedback. This work proposes LiSA, a framework that transforms sporadic failures into reusable strategic abstractions through structured memory and employs conflict-aware local rules to prevent overgeneralization. By integrating evidence-aware confidence gating—based on a posterior lower bound—with conservative policy induction, LiSA achieves robust and scalable safety adaptation under limited feedback. Experiments demonstrate that LiSA significantly outperforms strong baselines on the PrivacyLens+, ConFaide+, and AgentHarm benchmarks, maintains resilience under 20% label noise, and achieves a superior trade-off between latency and performance.
📝 Abstract
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.