🤖 AI Summary
This work investigates the impact of large language models’ (LLMs) reasoning capabilities on negotiation performance in multilingual settings, alongside associated computational cost trade-offs. We employ a self-play framework to systematically evaluate both open-source and commercial LLMs across three dialogue-based negotiation tasks—German, Italian, and English—incorporating test-time reasoning scaling strategies. Our key contribution is the first empirical identification of pervasive “language fallback to English” during internal reasoning in open-source models, whereas commercial models maintain consistent reasoning and output language alignment—highlighting a critical gap in interpretability and native-language reasoning fidelity. Experiments show that enabling reasoning boosts GPT-5’s negotiation performance by 31.4%, albeit at nearly a 400% increase in computational cost. Reasoning significantly improves collaborative behavior and robustness to complex strategies, yet also exposes language inconsistency across multilingual reasoning paths.
📝 Abstract
Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.