Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
Existing self-play methods rely solely on final game outcomes, making it difficult to distinguish between transferable reasoning patterns and task-specific heuristics, thereby limiting cross-domain generalization. This work proposes a trajectory-modulated self-play framework that identifies abstract reasoning trajectories through a learnable transferability coefficient and incorporates a reasoning evolution reward mechanism to foster adaptive reasoning development. By integrating trajectory-level reinforcement learning with dynamic context generation, the approach overcomes the limitations of domain specificity and static contextual representations. It achieves significant performance gains across benchmarks in mathematical reasoning, general-purpose reasoning, and code generation, with particularly notable advances on competition-level mathematical tasks. Ablation studies and human evaluations confirm the effectiveness of the proposed method.

Technology Category

Application Category

📝 Abstract
Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.
Problem

Research questions and friction points this paper is trying to address.

reasoning transfer
domain specificity
contextual stasis
self-play
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning transferability
trajectory-modulated self-play
domain-agnostic reasoning
reasoning evolution
language model reasoning