๐ค AI Summary
Existing language agents exhibit insufficient strategic reasoning in dynamic adversarial games, and there is a lack of systematic investigation into opponent selection mechanisms in self-play. Method: We propose Step-level poliCy Optimization (SCO-PAL), a multi-stage reinforcement learning framework grounded in self-play, featuring a dynamic feedback mechanism for fine-grained policy refinement. Contribution/Results: SCO-PAL introduces the first causal analysis of how opponent difficulty influences the evolution of strategic reasoning capabilities, and designs an adaptive opponent-matching strategy. Evaluated across six adversarial games, SCO-PAL achieves an average win rate improvement of ~30% over baselines and attains a 54.76% win rate against GPT-4โsignificantly outperforming conventional methods reliant on expert annotations or static feedback. This work establishes a scalable paradigm for strategic autonomous learning in language agents.
๐ Abstract
Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.