Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Cross-domain offline reinforcement learning confronts dual challenges: source-target domain dynamics mismatch and heterogeneous sample utility in the source domain. Existing approaches focus solely on dynamics alignment while neglecting high-value sample selection. This paper first systematically establishes the synergistic necessity of dynamics alignment and value alignment, develops a theoretical analysis framework, and proposes DVDF—a Dual-criterion Value-aware Dynamics Filtering method. DVDF quantifies inter-domain dynamics shift via a learned dynamics model and jointly evaluates sample quality using value function estimates, enabling coordinated filtering under both criteria. Extensive experiments demonstrate that DVDF significantly outperforms state-of-the-art methods in settings with substantial kinematic/morphological disparities between domains and extremely limited target-domain data (as few as 5,000 transitions). Moreover, DVDF exhibits strong robustness and cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook extit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our extbf{underline{D}}ynamics- and extbf{underline{V}}alue-aligned extbf{underline{D}}ata extbf{underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses cross-domain offline reinforcement learning challenges

Proposes filtering source data for dynamics and value alignment

Improves policy performance with limited target domain data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selectively shares source domain samples with high dynamics alignment

Incorporates value alignment for selecting high-quality source data

Proposes DVDF method for cross-domain offline policy adaptation

🔎 Similar Papers

No similar papers found.