🤖 AI Summary
This paper addresses the pervasive **temporal delay feedback problem** in multi-agent game learning, revealing its fundamental mechanism—inducing optimization mismatch and degrading OFTRL performance. We propose **Weighted OFTRL**, which incorporates an *n*-step weighted future reward prediction vector into the OFTRL framework. We provide the first theoretical proof that even a single-step delay incurs linear regret growth, while appropriately designed optimistic weights fully neutralize delay effects. Theoretically, Weighted OFTRL achieves *O*(1) constant regret in general-sum games and optimal *O*(1/√*T*) convergence to Nash equilibria in multi-matrix zero-sum games. Experiments validate the effectiveness and robustness of the prediction-enhanced mechanism. Our core contribution lies in establishing a rigorous theoretical characterization of the delay–optimization mismatch and achieving delay-invariant, robust convergence via structured prediction.
📝 Abstract
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.