Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work establishes the first finite-sample theoretical guarantees for Deep Q-Networks (DQN) under τ-mixing conditions, relaxing the common assumption of independent and identically distributed replay data by accounting for temporal dependencies inherent in trajectory-based samples. The authors model the DQN update as a nonparametric regression problem based on τ-mixing observations and derive explicit risk bounds and sample complexity that incorporate the underlying dependence structure. Their analysis reveals that temporal dependence degrades statistical convergence rates and introduces a penalty term scaling with the effective dimensionality of the problem. Empirical validation on Gymnasium environments demonstrates that replay data exhibit exponentially decaying correlations, providing empirical support for the proposed theoretical framework.
📝 Abstract
Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $τ$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $τ$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.
Problem

Research questions and friction points this paper is trying to address.

Deep Q-Learning
Finite-sample analysis
Temporal dependence
τ-Mixing
Replay buffer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Q-learning
τ-mixing
finite-sample guarantees
temporal dependence
nonparametric regression