🤖 AI Summary
This study investigates whether general-purpose artificial intelligence (GPAI) systems exhibit data-induced irrational judgments in software engineering due to human cognitive biases embedded in their training data.
Method: The authors introduce the first dynamic cognitive bias evaluation framework tailored to software engineering, integrating Prolog-based formal reasoning, LLM-as-a-judge validation, and a seed-task-driven GPAI self-generation pipeline—enabling controllable bias injection, high task diversity, and adjustable logical reasoning complexity.
Contribution/Results: Experiments reveal pervasive cognitive biases across mainstream GPAI systems (5.9%–35% bias rates), with bias incidence escalating sharply as task logical complexity increases (up to 49%). These findings expose critical reliability risks for GPAI in real-world development scenarios. The framework establishes a scalable, reproducible methodology for systematic bias assessment in AI systems, advancing rigorous evaluation of reasoning fidelity in software engineering contexts.
📝 Abstract
Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases?
To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions.
To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88--99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning and LLM-as-a-judge validation. It also verifies that the embedded biases are both harmful and undetectable by logic-based, unbiased reasoners.
We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over deep reasoning. All systems exhibit cognitive biases (ranging from 5.9% to 35% across types), with bias sensitivity increasing sharply with task complexity (up to 49%), highlighting critical risks in real-world software engineering deployments.