🤖 AI Summary
To address insufficient policy generalization in sim-to-real transfer due to dynamical discrepancies, this paper proposes a context-aware reinforcement learning framework. The method enables adaptive control in unknown real-world environments by online estimating physical dynamics parameters—such as friction, mass, and inertia—and feeding them as conditional inputs to the policy network. It integrates domain randomization, state inference, and conditional policy networks, incorporating a learnable dynamic context encoding module during training. Evaluated on standard control benchmarks (CartPole, Reacher) and a real robotic pushing task, the approach significantly outperforms context-agnostic baselines, achieving an average 32.7% improvement in task success rate under unseen dynamical configurations, while maintaining real-time inference capability. The core contribution lies in explicitly modeling implicit dynamics via a lightweight, online context estimation mechanism—and empirically demonstrating its critical role in enhancing cross-domain robustness.
📝 Abstract
Sim-to-real transfer remains a major challenge in reinforcement learning (RL) for robotics, as policies trained in simulation often fail to generalize to the real world due to discrepancies in environment dynamics. Domain Randomization (DR) mitigates this issue by exposing the policy to a wide range of randomized dynamics during training, yet leading to a reduction in performance. While standard approaches typically train policies agnostic to these variations, we investigate whether sim-to-real transfer can be improved by conditioning the policy on an estimate of the dynamics parameters -- referred to as context. To this end, we integrate a context estimation module into a DR-based RL framework and systematically compare SOTA supervision strategies. We evaluate the resulting context-aware policies in both a canonical control benchmark and a real-world pushing task using a Franka Emika Panda robot. Results show that context-aware policies outperform the context-agnostic baseline across all settings, although the best supervision strategy depends on the task.