CORE: Compensable Reward as a Catalyst for Improving Offline RL in Wireless Networks

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline reinforcement learning (Offline RL) in wireless networks suffers from overfitting to suboptimal policies due to scarce expert demonstrations and highly noisy offline datasets. Method: This work presents the first systematic exploration of the Offline RL paradigm for wireless domains, proposing a novel framework integrating behavioral embedding clustering with a contrastive-regularized conditional variational autoencoder (CVAE). It identifies implicit expert trajectories via clustering in behavioral embedding space, employs contrastive learning to disentangle expert and non-expert representations, and constructs a compensatory expert-likelihood reward to mitigate supervision scarcity. Contribution/Results: The approach significantly enhances policy stability and generalization. Evaluated on diverse wireless resource scheduling tasks, it outperforms state-of-the-art Offline RL baselines by 15–32%. This work delivers the first reproducible, domain-aligned Offline RL solution for wireless intelligence.

Technology Category

Application Category

📝 Abstract
Real-world wireless data are expensive to collect and often lack sufficient expert demonstrations, causing existing offline RL methods to overfit suboptimal behaviors and exhibit unstable performance. To address this issue, we propose CORE, an offline RL framework specifically designed for wireless environments. CORE identifies latent expert trajectories from noisy datasets via behavior embedding clustering, and trains a conditional variational autoencoder with a contrastive objective to separate expert and non-expert behaviors in latent space. Based on the learned representations, CORE constructs compensable rewards that reflect expert-likelihood, effectively guiding policy learning under limited or imperfect supervision. More broadly, this work represents one of the early systematic explorations of offline RL in wireless networking, where prior adoption remains limited. Beyond introducing offline RL techniques to this domain, we further examine intrinsic wireless data characteristics and develop a domain-aligned algorithm that explicitly accounts for their structural properties. While offline RL has not yet been fully established as a standard methodology in the wireless community, our study aims to provide foundational insights and empirical evidence to support its broader acceptance.
Problem

Research questions and friction points this paper is trying to address.

Improves offline RL performance in wireless networks
Addresses limited expert data and overfitting issues
Develops domain-aligned algorithm for wireless data characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavior embedding clustering identifies expert trajectories
Conditional variational autoencoder separates expert behaviors
Compensable rewards guide policy with limited supervision
🔎 Similar Papers
No similar papers found.