🤖 AI Summary
Existing AI safety methods primarily target predefined domain-specific constraints, suffering from limited generalizability and insufficient probabilistic controllability. This paper introduces the first domain-agnostic, scalable safety assurance framework supporting arbitrary user-specified constraints and exact target satisfaction probabilities. Our method integrates differentiable approximations of probabilistic constraints with gradient-based optimization, grounded in a theoretically rigorous verification framework based on internal test data and conservative statistical hypothesis testing. Key contributions include: (1) a universal safety mechanism compatible with arbitrary constraints and probability thresholds; (2) the first formal proof of probabilistic constraint satisfaction and quantifiable scaling laws linking safety guarantees to test set size; and (3) an end-to-end trainable formulation via differentiable loss design and gradient computation. Empirical evaluation across demand forecasting, SafetyGym, and dialogue safety tasks demonstrates superior performance—especially under stringent safety thresholds—where our approach outperforms baselines by several orders of magnitude, while safety guarantees scale robustly with internal test data volume.
📝 Abstract
Ensuring the safety of AI systems has recently emerged as a critical priority for real-world deployment, particularly in physical AI applications. Current approaches to AI safety typically address predefined domain-specific safety conditions, limiting their ability to generalize across contexts. We propose a novel AI safety framework that ensures AI systems comply with extbf{any user-defined constraint}, with extbf{any desired probability}, and across extbf{various domains}. In this framework, we combine an AI component (e.g., neural network) with an optimization problem to produce responses that minimize objectives while satisfying user-defined constraints with probabilities exceeding user-defined thresholds. For credibility assessment of the AI component, we propose extit{internal test data}, a supplementary set of safety-labeled data, and a extit{conservative testing} methodology that provides statistical validity of using internal test data. We also present an approximation method of a loss function and how to compute its gradient for training. We mathematically prove that probabilistic constraint satisfaction is guaranteed under specific, mild conditions and prove a scaling law between safety and the number of internal test data. We demonstrate our framework's effectiveness through experiments in diverse domains: demand prediction for production decision, safe reinforcement learning within the SafetyGym simulator, and guarding AI chatbot outputs. Through these experiments, we demonstrate that our method guarantees safety for user-specified constraints, outperforms {for extbf{up to several order of magnitudes}} existing methods in low safety threshold regions, and scales effectively with respect to the size of internal test data.