LLMs Are Not a Silver Bullet: A Case Study on Software Fairness

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) outperform traditional machine learning methods in software fairness tasks. Addressing the lack of empirical guidance for technical selection in practice, we present the first systematic comparison between mainstream LLM approaches—including in-context learning and supervised fine-tuning—and conventional bias mitigation techniques, jointly evaluating fairness and predictive performance across multiple real-world datasets. Our findings demonstrate that traditional machine learning methods consistently achieve superior fairness and accuracy compared to LLMs, with the latter showing limited gains even when fully fine-tuned on all available data. We further reveal that prior optimistic conclusions about LLMs’ fairness capabilities stem from biases introduced by artificially balanced test sets, and we propose a more rigorous evaluation paradigm that underscores LLMs are not a universal solution for software fairness.

Technology Category

Application Category

📝 Abstract

Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.

Problem

Research questions and friction points this paper is trying to address.

fairness

bias mitigation

Large Language Models

machine learning

software fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Software Fairness

Bias Mitigation