🤖 AI Summary
This work addresses robust out-of-distribution generalization of policies in reinforcement learning, focusing on learning representations with **control sufficiency**—not merely observational sufficiency. To this end, we formulate contextual RL as a **decoupled inference–control problem**, theoretically characterize the hierarchical relationship between observational and control sufficiency, and design an ELBO-style objective based on the variational information bottleneck to explicitly separate representation learning from policy optimization. Our method employs a variational encoder jointly with an off-policy policy learner. Evaluated on continuous-control benchmarks with physical parameter shifts, it achieves significantly improved sample efficiency and superior policy robustness under out-of-distribution dynamics compared to baselines. Moreover, it unifies theoretical analysis and practical implementation for contextual RL.
📝 Abstract
Capturing latent variations ("contexts") is key to deploying reinforcement-learning (RL) agents beyond their training regime. We recast context-based RL as a dual inference-control problem and formally characterize two properties and their hierarchy: observation sufficiency (preserving all predictive information) and control sufficiency (retaining decision-making relevant information). Exploiting this dichotomy, we derive a contextual evidence lower bound(ELBO)-style objective that cleanly separates representation learning from policy learning and optimizes it with Bottlenecked Contextual Policy Optimization (BCPO), an algorithm that places a variational information-bottleneck encoder in front of any off-policy policy learner. On standard continuous-control benchmarks with shifting physical parameters, BCPO matches or surpasses other baselines while using fewer samples and retaining performance far outside the training regime. The framework unifies theory, diagnostics, and practice for context-based RL.