Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance degradation and safety risks faced by reinforcement learning agents during sim-to-real (Sim2Real) transfer due to environmental discrepancies. To mitigate these challenges, the authors propose a meta-reinforcement learning framework that integrates probabilistic latent embeddings with dynamic risk modulation. The approach infers real-world environmental characteristics through latent context variables and dynamically adjusts the policy’s risk sensitivity by combining distributional reinforcement learning with constrained Markov decision processes (CMDPs). This enables the agent to maintain safety during initial deployment while significantly improving adaptation efficiency in the target real-world environment, thereby effectively alleviating the performance deterioration and safety hazards induced by the Sim2Real gap.
πŸ“ Abstract
Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.
Problem

Research questions and friction points this paper is trying to address.

Sim-to-Real
reinforcement learning
safety violation
performance degradation
domain gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

probabilistic latent embeddings
dynamic policy adaptation
Sim-to-Real transfer
distributional reinforcement learning
constrained MDPs