🤖 AI Summary
Offline reinforcement learning (RL) suffers from the scarcity and high cost of labeled data—particularly human-provided reward annotations—while abundant unlabeled trajectory data remains underutilized. To address this, we propose a novel offline RL framework that effectively incorporates unlabeled trajectories by introducing kernel function approximation into the offline RL paradigm. Specifically, we model both policies and value functions in a reproducing kernel Hilbert space (RKHS), and establish theoretical guarantees grounded in the eigenvalue decay of the RKHS kernel operator. Our method significantly improves policy performance under stringent labeling budgets and provides a provable upper bound on sample complexity. To the best of our knowledge, this is the first approach that simultaneously achieves rigorous theoretical foundations—via nonparametric statistical analysis in RKHS—and practical efficacy in leveraging unlabeled data for offline RL.
📝 Abstract
Offline reinforcement learning (RL) learns policies from a fixed dataset, but often requires large amounts of data. The challenge arises when labeled datasets are expensive, especially when rewards have to be provided by human labelers for large datasets. In contrast, unlabelled data tends to be less expensive. This situation highlights the importance of finding effective ways to use unlabelled data in offline RL, especially when labelled data is limited or expensive to obtain. In this paper, we present the algorithm to utilize the unlabeled data in the offline RL method with kernel function approximation and give the theoretical guarantee. We present various eigenvalue decay conditions of $mathcal{H}_k$ which determine the complexity of the algorithm. In summary, our work provides a promising approach for exploiting the advantages offered by unlabeled data in offline RL, whilst maintaining theoretical assurances.