Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of value misalignment in heterogeneous cross-domain offline reinforcement learning, where source-domain data—collected under different dynamics and behavior policies—can induce misleading value estimates and impair effective data selection. To tackle this issue, the paper introduces V2A, a novel method that, for the first time, explicitly identifies and mitigates value misalignment in such settings. V2A establishes a unified framework integrating dynamics alignment, value alignment, and value-aware data allocation. It leverages temporally consistent modality representation learning to extract dynamics-related modalities and employs modality-aware advantage learning to correct value estimation. Furthermore, it incorporates a value-based data filtering mechanism to selectively transfer informative source-domain transitions. Experimental results demonstrate that V2A significantly outperforms strong existing baselines under general heterogeneous cross-domain offline RL benchmarks.

📝 Abstract

Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general heterogeneous cross-domain offline RL setting, where the source datasets may be collected from multiple source domains by diverse behavior policies. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent's performance. To address this issue, we propose V2A, which integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that V2A significantly outperforms strong baseline methods under general heterogeneous cross-domain offline RL settings.

Problem

Research questions and friction points this paper is trying to address.

cross-domain offline reinforcement learning

heterogeneous datasets

value misassignment

value alignment

dynamics shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

value misassignment

cross-domain offline RL

heterogeneous datasets