π€ AI Summary
Directly adapting large language models (LLMs) to multi-agent systems (MAS) faces challenges including complex reward modeling, highly dynamic agent interactions, and stringent generalization requirements. Method: This paper proposes a domain-aligned post-training paradigm, using economics as a structured testbed; it combines supervised fine-tuning (SFT) with reinforcement learning from verifiable rewards (RLVR) on a high-quality, self-constructed economic reasoning dataset (2,100 problems) to train a 7B open-weight LLM. Contribution/Results: This work introduces RLVR to economic reasoning for the first time, significantly improving the modelβs equilibrium prediction accuracy, strategic consistency, and economic rationality in unseen multi-agent games. It empirically demonstrates that post-training can effectively induce cross-task strategic generalization and shape agent alignment with rational behavioral patterns.
π Abstract
Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $ extit{generalize}$ to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce $ extbf{Recon}$ ($ extbf{R}$easoning like an $ extbf{ECON}$omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .