LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

225K/year
🤖 AI Summary
Real-world AI agents face dynamic, context-dependent safety risks, yet traditional safeguards struggle to adapt continuously due to reliance on sparse and noisy user feedback. This work proposes LiSA, a framework that transforms sporadic failures into reusable strategic abstractions through structured memory and employs conflict-aware local rules to prevent overgeneralization. By integrating evidence-aware confidence gating—based on a posterior lower bound—with conservative policy induction, LiSA achieves robust and scalable safety adaptation under limited feedback. Experiments demonstrate that LiSA significantly outperforms strong baselines on the PrivacyLens+, ConFaide+, and AgentHarm benchmarks, maintains resilience under 20% label noise, and achieves a superior trade-off between latency and performance.
📝 Abstract
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
Problem

Research questions and friction points this paper is trying to address.

guardrail adaptation
contextual safety
lifelong learning
sparse feedback
AI safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifelong Safety Adaptation
Conservative Policy Induction
Sparse Feedback Learning
Conflict-Aware Rules
Evidence-Aware Confidence Gating