🤖 AI Summary
In large imperfect-information games, online exploitation of subrational opponents under depth-limited search poses a fundamental trade-off among depth constraints, adaptability, and robustness.
Method: We propose a matrix-valued state-policy composition framework—the first to enable full-depth opponent behavioral modeling under depth-limited search. Our approach integrates matrix-valued state representation, policy composition optimization, a depth-limited extension of Monte Carlo Counterfactual Regret Minimization (MCCFR), and online Bayesian opponent modeling—yielding a structurally simpler alternative to conventional value-function-based methods.
Contribution/Results: Evaluated on poker and Battleship, our method achieves over 2× utility gain against depth-external suboptimal opponents compared to baselines, while significantly improving average utility and safety performance against random opponents. It overcomes the restrictive assumption of perfect rationality inherent in traditional approaches, enabling robust adaptation to diverse opponent types within computational budget constraints.
📝 Abstract
We study the problem of adapting to a known sub-rational opponent during online play while remaining robust to rational opponents. We focus on large imperfect-information (zero-sum) games, which makes it impossible to inspect the whole game tree at once and necessitates the use of depth-limited search. However, all existing methods assume rational play beyond the depth-limit, which only allows them to adapt a very limited portion of the opponent's behaviour. We propose an algorithm Adapting Beyond Depth-limit (ABD) that uses a strategy-portfolio approach - which we refer to as matrix-valued states - for depth-limited search. This allows the algorithm to fully utilise all information about the opponent model, making it the first robust-adaptation method to be able to do so in large imperfect-information games. As an additional benefit, the use of matrix-valued states makes the algorithm simpler than traditional methods based on optimal value functions. Our experimental results in poker and battleship show that ABD yields more than a twofold increase in utility when facing opponents who make mistakes beyond the depth limit and also delivers significant improvements in utility and safety against randomly generated opponents.