Evaluating Binary Decision Biases in Large Language Models: Implications for Fair Agent-Based Financial Simulations

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study systematically investigates selective judgment biases in large language models (LLMs) when simulating binary financial market decisions, and their detrimental impact on simulation fairness and accuracy. Methodologically, we compare response distributions across GPT variants—including GPT-4o-Mini-2024-07-18 and GPT-4-0125-preview—under varying conditions: one-shot versus few-shot prompting, temperature-sampling settings, and batch versus single API calls; we further employ negative recency effect tests and true-random sequence baselines. Key contributions include: (1) identifying inter-version “Yes/No” response disparities exceeding 60 percentage points; (2) demonstrating that GPT-4o-Mini achieves balanced “Yes” rates of 32–43%, markedly outperforming GPT-4’s extreme bias (98–99%); (3) showing few-shot prompting approximates 50% equilibrium distribution, whereas one-shot fails to satisfy both uniformity and Markovianity; and (4) revealing that certain models surpass human performance on negative recency tasks. These findings establish a critical bias diagnostic framework and enable controllable generation for LLM-based financial agent modeling.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models (ABMs). As models become more powerful and accessible, researchers can now incorporate individual LLM decisions into ABM environments. However, integration may introduce inherent biases that need careful evaluation. In this paper we test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries. We observe significant variations in distributions of outputs between specific models, and model sub versions, with GPT-4o-Mini-2024-07-18 showing notably better performance (32-43% yes responses) compared to GPT-4-0125-preview's extreme bias (98-99% yes responses). We show that sampling methods and model sub-versions significantly impact results: repeated independent API calls produce different distributions compared to batch sampling within a single call. While no current GPT model can simultaneously achieve a uniform distribution and Markovian properties in one-shot testing, few-shot sampling can approach uniform distributions under certain conditions. We explore the Temperature parameter, providing a definition and comparative results. We further compare our results to true random binary series and test specifically for the common human bias of Negative Recency - finding LLMs have a mixed ability to 'beat' humans in this one regard. These findings emphasise the critical importance of careful LLM integration into ABMs for financial markets and more broadly.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Financial Market Decisions

Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Financial Decision Bias

Temperature Parameter Influence

Randomness and Bias Assessment

🔎 Similar Papers

How Ethical Should AI Be? How AI Alignment Shapes the Risk Preferences of LLMs