🤖 AI Summary
This work addresses the challenge of balancing reward and safety in out-of-distribution deployment scenarios, where relying solely on pretrained safe contextual reinforcement learning often proves insufficient. To this end, the authors propose a novel Latent Q-Barrier mechanism that operates without test-time parameter updates. By inferring context from historical interactions and integrating remaining safety budgets with predictions of future costs, the method filters or softly reweights actions through an explicit action-level safety check. This approach enables superior online safety control under a frozen policy and provides theoretical guarantees on barrier margins grounded in error decomposition. Empirical evaluation across five safe inverse constrained reinforcement learning (ICRL) benchmarks demonstrates that the method achieves higher returns in four tasks while maintaining equal or lower average episode costs across all tasks.
📝 Abstract
Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pretraining-only safe ICRL can give poor reward-safety tradeoffs because the remaining budget affects behavior only through frozen policy conditioning, not an explicit action-level check against predicted future cost. We propose a latent Q-Barrier shield that learns a context representation, latent dynamics, and an ensemble cost critic before deployment. Without parameter updates, the shield infers context from history and filters or softly reweights candidate actions using the remaining budget and predicted future cost. We prove a conditional, error-decomposed barrier-margin result: a Q-Barrier-satisfying action leaves the next latent-budget state with an approximately budget-safe continuation under the learned critic, up to Bellman and latent-prediction errors. Across five safe ICRL benchmarks, the shield improves deployment-time reward-safety tradeoffs over a strong safe-ICRL baseline: after a short context window, it achieves higher return in four of five benchmarks while matching or lowering average episode cost in all five.