Correlated Errors in Large Language Models

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates cross-model correlations in error patterns of large language models (LLMs) and their implications for downstream tasks. Method: Leveraging large-scale empirical analysis across 350+ models on mainstream benchmarks and a realistic resume-screening task, we propose the concept of “algorithmic monoculture,” quantify multi-dimensional error correlations, measure cross-model error consistency, conduct systematic leaderboard evaluation, and perform scenario-based experiments. Contribution/Results: We首次 discover significant error consistency (up to 60%) among high-accuracy LLMs—unrelated to shared architecture or vendor—challenging the common assumption that model diversity inherently mitigates homogeneity risks. This monoculture amplifies biases in LLM-as-judge evaluations and exacerbates hiring discrimination. Our findings provide critical empirical evidence for robustness assessment and responsible deployment of LLMs, urging reevaluation of diversity-based risk mitigation strategies in practice.

Technology Category

Application Category

📝 Abstract
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.
Problem

Research questions and friction points this paper is trying to address.

Investigates error correlation among diverse large language models
Examines factors like shared architectures causing model error agreement
Assesses downstream impacts on LLM-as-judge and hiring tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of 350 LLMs
Identified error correlation factors
Analyzed downstream task impacts
🔎 Similar Papers
No similar papers found.
E
Elliot Kim
Cornell University
A
Avi Garg
Independent
Kenny Peng
Kenny Peng
PhD Student in Computer Science, Cornell University
Machine LearningAlgorithmic Fairness
N
Nikhil Garg
Cornell University