🤖 AI Summary
This paper addresses online decision optimization under queue stability constraints in Internet-of-Things (IoT) systems. Method: We propose LDPTRLQ, the first algorithm to achieve a deep theoretical integration of the Lyapunov drift-plus-penalty framework with reinforcement learning (RL) under realistic assumptions—departing from the conventional “stabilize-then-optimize” paradigm. LDPTRLQ intrinsically embeds queue stability as a constraint within the RL policy update, simultaneously guaranteeing strict Lyapunov stability and maximizing long-term cumulative reward. Contribution/Results: Theoretical analysis establishes convergence and stability guarantees. Extensive evaluations across diverse IoT simulation tasks demonstrate that LDPTRLQ significantly outperforms pure Lyapunov-based methods, standalone RL approaches, and state-of-the-art baselines in convergence speed, queue stability, and policy performance—achieving new state-of-the-art (SOTA) results.
📝 Abstract
With the proliferation of Internet of Things (IoT) devices, the demand for addressing complex optimization challenges has intensified. The Lyapunov Drift-Plus-Penalty algorithm is a widely adopted approach for ensuring queue stability, and some research has preliminarily explored its integration with reinforcement learning (RL). In this paper, we investigate the adaptation of the Lyapunov Drift-Plus-Penalty algorithm for RL applications, deriving an effective method for combining Lyapunov Drift-Plus-Penalty with RL under a set of common and reasonable conditions through rigorous theoretical analysis. Unlike existing approaches that directly merge the two frameworks, our proposed algorithm, termed Lyapunov drift-plus-penalty method tailored for reinforcement learning with queue stability (LDPTRLQ) algorithm, offers theoretical superiority by effectively balancing the greedy optimization of Lyapunov Drift-Plus-Penalty with the long-term perspective of RL. Simulation results for multiple problems demonstrate that LDPTRLQ outperforms the baseline methods using the Lyapunov drift-plus-penalty method and RL, corroborating the validity of our theoretical derivations. The results also demonstrate that our proposed algorithm outperforms other benchmarks in terms of compatibility and stability.