Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing self-play methods rely solely on final game outcomes, making it difficult to distinguish between transferable reasoning patterns and task-specific heuristics, thereby limiting cross-domain generalization. This work proposes a trajectory-modulated self-play framework that identifies abstract reasoning trajectories through a learnable transferability coefficient and incorporates a reasoning evolution reward mechanism to foster adaptive reasoning development. By integrating trajectory-level reinforcement learning with dynamic context generation, the approach overcomes the limitations of domain specificity and static contextual representations. It achieves significant performance gains across benchmarks in mathematical reasoning, general-purpose reasoning, and code generation, with particularly notable advances on competition-level mathematical tasks. Ablation studies and human evaluations confirm the effectiveness of the proposed method.

Technology Category

Application Category

📝 Abstract

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

Problem

Research questions and friction points this paper is trying to address.

reasoning transfer

domain specificity

contextual stasis

self-play

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning transferability

trajectory-modulated self-play

domain-agnostic reasoning