🤖 AI Summary
This work addresses the risk that autonomous AI agents may become unsafe due to behavioral drift, adversarial adaptation, or shifts in decision-making patterns—even without code modifications—while retaining their operational authorization. To tackle this challenge, the paper introduces the Agent Viability Framework grounded in the principle of informational feasibility, marking the first application of Aubin viability theory to AI governance and enabling a shift from reactive to proactive safety assurance. The framework estimates upper bounds on unobserved risks using statistical methods including KL divergence, paragraph-wise contrastive z-tests, and sequential pattern matching. Runtime interventions are enacted via monotonic safety conduits and a formalized auto-regulation mechanism featuring emergency circuit breakers. The study defines a Viability Index and first-passage time prediction, establishing a theoretical foundation and reference implementation (RiskGate) that collectively enable quantifiable safety evaluation across canonical failure modes.
📝 Abstract
Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbf{Informational Viability Principle}: governing an agent reduces to estimating a bound on unobserved risk $\hat{B}(x) = U(x) + SB(x) + RG(x)$ and allowing an action only when its capacity $S(x)$ exceeds $\hat{B}(x)$ by a safety margin. The \textbf{Agent Viability Framework}, grounded in Aubin's viability theory, establishes three properties -- monitoring (P1), anticipation (P2), and monotonic restriction (P3) -- as individually necessary and collectively sufficient for documented failure modes. \textbf{RiskGate} instantiates the framework with dedicated statistical estimators (KL divergence, segment-vs-rest $z$-tests, sequential pattern matching), a fail-secure monotonic pipeline, and a closed-loop Autopilot formalised as an instance of Aubin's regulation map with kill-switch-as-last-resort; a scalar Viability Index $VI(t) \in [-1,+1]$ with first-order $t^*$ prediction transforms governance from reactive to predictive. Contributions are the theoretical framework, the reference implementation, and analytical coverage against published agent-failure taxonomies; quantitative empirical evaluation is scoped as follow-up work.