Offline Reinforcement Learning with Domain-Unlabeled Data

📅 2024-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline reinforcement learning (RL) faces challenges in data-expensive domains (e.g., robotics, healthcare), where multi-domain trajectory data are mixed and labeled target-domain samples are scarce. To address this, we propose Positive-Unlabeled Offline RL (PUORL), a novel paradigm that leverages only a small number of labeled target-domain trajectories alongside abundant unlabeled multi-domain trajectories. By applying positive-unlabeled (PU) learning to train a domain classifier, PUORL accurately identifies and filters target-domain data, thereby augmenting the limited labeled samples. The method is plug-and-play—compatible with mainstream offline RL algorithms (e.g., CQL, BCQ) without modifying their policy optimization procedures. On D4RL’s multi-domain variants, PUORL achieves high-precision target-domain identification using only 1–3% domain-labeled data and improves policy performance by over 20% under significant dynamics shift compared to baselines. This work marks the first systematic integration of PU learning into cross-domain adaptation for offline RL.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple domains that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For example, in robotics, precise system identification may only have been performed for part of the deployments. To address this challenge, we consider Positive-Unlabeled Offline RL (PUORL), a novel offline RL setting in which we have a small amount of labeled target-domain data and a large amount of domain-unlabeled data from multiple domains, including the target domain. For PUORL, we propose a plug-and-play approach that leverages positive-unlabeled (PU) learning to train a domain classifier. The classifier then extracts target-domain samples from the domain-unlabeled data, augmenting the scarce target-domain data. Empirical results on a modified version of the D4RL benchmark demonstrate the effectiveness of our method: even when only 1 to 3 percent of the dataset is domain-labeled, our approach accurately identifies target-domain samples and achieves high performance, even under substantial dynamics shift. Our plug-and-play algorithm seamlessly integrates PU learning with existing offline RL pipelines, enabling effective multi-domain data utilization in scenarios where comprehensive domain labeling is prohibitive.
Problem

Research questions and friction points this paper is trying to address.

Addresses offline RL with domain-unlabeled data.
Proposes PUORL for multi-domain data utilization.
Enhances target-domain sample identification accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive-Unlabeled Offline RL for domain-unlabeled data
Plug-and-play approach with PU learning for domain classification
Augments target-domain data using extracted samples from unlabeled data