π€ AI Summary
This work addresses the degradation of variance reduction guarantees in classical majority-vote ensembles when training data exhibit Markovian dependence, as commonly arises in time series or reinforcement learning replay buffers. Focusing on discrete classification, it establishes the first minimax risk lower bound for ensemble learning under a fixed-dimensional Markov chain, revealing a suboptimality gap of order βTβα΅’β for uniform Bagging. To bridge this gap, the authors propose an adaptive spectral routing algorithm that partitions data using the Fiedler eigenvector of the dependency graph, achieving a near-optimal risk rate of πͺ(β(Tβα΅’β/n)) without requiring prior knowledge of the mixing time. The theoretical analysis integrates information-theoretic lower bounds, Markov ergodicity, and spectral properties of graph Laplacians, with empirical validation demonstrating superior performance on synthetic chains, spatial grids, UCR time-series datasets, and Atari DQN ensembles.
π Abstract
Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $Ξ©(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $Ξ©(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via NystrΓΆm approximation, and bounded non-stationarity are developed as supporting material in the appendix.