Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current LLM evaluation overlooks implicit behavioral shifts during model upgrades, risking deployment of models with degraded social reasoning robustness. Method: Grounded in behavioral economics’ trust game framework, this study employs multi-round interactive prompting and quantitative behavioral modeling to systematically compare strategic decision-making in trust scenarios between DeepSeek and GPT-o1/o3-mini. Contribution/Results: We identify, for the first time, a “trust behavior collapse” in GPT-mini—characterized by abrupt, inconsistent defection—while DeepSeek exhibits forward-planning and theory-of-mind–guided cooperative stability. Empirically, DeepSeek achieves 37% higher average payoff and 2.1× greater strategy stability than GPT-mini. These findings challenge the sufficiency of conventional performance benchmarks and establish a critical behavioral assessment paradigm for high-stakes AI deployment, emphasizing robustness in higher-order social reasoning over narrow task accuracy.

Technology Category

Application Category

📝 Abstract

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

Problem

Research questions and friction points this paper is trying to address.

Assessing trust behavior in LLMs

Comparing DeepSeek and OpenAI model trust

Identifying hidden faults in LLM benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

game-theoretic behavioral economics model

contrasting trust behavior analysis

forward planning and theory-of-mind integration

🔎 Similar Papers

No similar papers found.