Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study investigates whether general-purpose artificial intelligence (GPAI) systems exhibit data-induced irrational judgments in software engineering due to human cognitive biases embedded in their training data. Method: The authors introduce the first dynamic cognitive bias evaluation framework tailored to software engineering, integrating Prolog-based formal reasoning, LLM-as-a-judge validation, and a seed-task-driven GPAI self-generation pipeline—enabling controllable bias injection, high task diversity, and adjustable logical reasoning complexity. Contribution/Results: Experiments reveal pervasive cognitive biases across mainstream GPAI systems (5.9%–35% bias rates), with bias incidence escalating sharply as task logical complexity increases (up to 49%). These findings expose critical reliability risks for GPAI in real-world development scenarios. The framework establishes a scalable, reproducible methodology for systematic bias assessment in AI systems, advancing rigorous evaluation of reasoning fidelity in software engineering contexts.

Technology Category

Application Category

📝 Abstract

Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88--99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning and LLM-as-a-judge validation. It also verifies that the embedded biases are both harmful and undetectable by logic-based, unbiased reasoners. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over deep reasoning. All systems exhibit cognitive biases (ranging from 5.9% to 35% across types), with bias sensitivity increasing sharply with task complexity (up to 49%), highlighting critical risks in real-world software engineering deployments.

Problem

Research questions and friction points this paper is trying to address.

Investigates if GPAI systems exhibit human-like cognitive biases

Develops dynamic benchmarking framework for bias evaluation in GPAI

Assesses bias impact on GPAI performance in software engineering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmarking framework for GPAI bias evaluation

On-demand augmentation pipeline with Prolog-based validation

LLM-as-a-judge for bias detection and task diversity

🔎 Similar Papers

A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research