Learning from Delayed Feedback in Games via Extra Prediction

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the pervasive **temporal delay feedback problem** in multi-agent game learning, revealing its fundamental mechanism—inducing optimization mismatch and degrading OFTRL performance. We propose **Weighted OFTRL**, which incorporates an *n*-step weighted future reward prediction vector into the OFTRL framework. We provide the first theoretical proof that even a single-step delay incurs linear regret growth, while appropriately designed optimistic weights fully neutralize delay effects. Theoretically, Weighted OFTRL achieves *O*(1) constant regret in general-sum games and optimal *O*(1/√*T*) convergence to Nash equilibria in multi-matrix zero-sum games. Experiments validate the effectiveness and robustness of the prediction-enhanced mechanism. Our core contribution lies in establishing a rigorous theoretical characterization of the delay–optimization mismatch and achieving delay-invariant, robust convergence via structured prediction.

Technology Category

Application Category

📝 Abstract
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
Problem

Research questions and friction points this paper is trying to address.

Addresses time-delayed feedback in multi-agent game learning
Proves delayed rewards degrade optimistic algorithm performance
Introduces weighted prediction to compensate for delay effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted OFTRL with optimistic prediction
Optimistic weight cancels out time delay
Achieves constant regret and convergence
🔎 Similar Papers
No similar papers found.