🤖 AI Summary
Current safety mechanisms for large language models predominantly focus on individual outputs, rendering them inadequate for sensitive domains such as education and mental health, where risks emerge from cumulative contextual dependencies across interaction trajectories. This work introduces the Grounded Observer framework, which—drawing inspiration from robotic control theory—reformulates safety guardrails as a runtime behavioral control problem over interactive trajectories. By modeling closed-loop system constraints and incorporating context-aware interventions, the approach provides formally verifiable and executable behavioral guarantees. Evaluated in three real-world scenarios—casual conversation, home-based autism intervention, and school-based behavior de-escalation—the framework effectively prevents dialogues from drifting into undesirable states while maintaining adaptability to diverse social contexts and ensuring robust safety.
📝 Abstract
Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.