Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

๐Ÿ“… 2025-12-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the dual robustness deficiency in cross-domain offline reinforcement learningโ€”namely, distributional shift during training (out-of-distribution transfer) and dynamic environmental perturbations during testing. We propose the first framework that jointly optimizes robustness in both phases. Our core contribution is a robust cross-domain Bellman operator, integrating dynamic value penalization with Huber loss to simultaneously achieve: (i) conservative estimation against source-domain dynamics variation during training, and (ii) adaptive robustness to unseen environmental disturbances at test time. The method requires no online interaction or domain labels, making it suitable for real-world deployment with limited data coverage. Extensive experiments on diverse dynamic shift benchmarks demonstrate that our approach significantly outperforms mainstream offline RL baselines, improving policy stability by up to 23.6%, thereby validating the effectiveness and generalizability of dual robustness modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical extbf{D}ual- extbf{RO}bust extbf{C}ross-domain extbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
Problem

Research questions and friction points this paper is trying to address.

Enhances test-time robustness against dynamics perturbations in cross-domain offline RL.
Addresses fragility of policies under dynamics shifts with limited target domain data.
Introduces a robust Bellman operator to prevent value overestimation and underestimation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces robust cross-domain Bellman operator for test-time robustness
Uses dynamic value penalty to counteract value overestimation
Incorporates Huber loss to address potential value underestimation
๐Ÿ”Ž Similar Papers
No similar papers found.