🤖 AI Summary
To address inefficient cooperative exploration in multi-agent reinforcement learning (MARL), this paper proposes a randomized least-squares value iteration (RLSVI) algorithm based on state-aggregated representations. The method establishes a concurrent learning framework and provides the first theoretical proof that randomization of value functions significantly improves parallel exploration efficiency among agents. It derives polynomial worst-case regret bounds for both finite- and infinite-horizon settings, achieving the optimal single-agent regret decay rate of $O(1/sqrt{N})$. Moreover, the algorithm reduces space complexity by a factor of $K$, while incurring only a $sqrt{K}$-factor increase in regret. Numerical experiments validate both its theoretical guarantees and empirical effectiveness.
📝 Abstract
Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents {it concurently} explore an environment. The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to extit{randomized least-squares value iteration} (RLSVI) with extit{aggregated state representation}. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of $Thetaleft(frac{1}{sqrt{N}}
ight)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to cite{russo2019worst} and cite{agrawal2021improved}. We reduce the space complexity by a factor of $K$ while incurring only a $sqrt{K}$ increase in the worst-case regret bound, compared to citep{agrawal2021improved,russo2019worst}. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.