Correlated Errors in Large Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates cross-model correlations in error patterns of large language models (LLMs) and their implications for downstream tasks. Method: Leveraging large-scale empirical analysis across 350+ models on mainstream benchmarks and a realistic resume-screening task, we propose the concept of “algorithmic monoculture,” quantify multi-dimensional error correlations, measure cross-model error consistency, conduct systematic leaderboard evaluation, and perform scenario-based experiments. Contribution/Results: We首次 discover significant error consistency (up to 60%) among high-accuracy LLMs—unrelated to shared architecture or vendor—challenging the common assumption that model diversity inherently mitigates homogeneity risks. This monoculture amplifies biases in LLM-as-judge evaluations and exacerbates hiring discrimination. Our findings provide critical empirical evidence for robustness assessment and responsible deployment of LLMs, urging reevaluation of diversity-based risk mitigation strategies in practice.

Technology Category

Application Category

📝 Abstract

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.

Problem

Research questions and friction points this paper is trying to address.

Investigates error correlation among diverse large language models

Examines factors like shared architectures causing model error agreement

Assesses downstream impacts on LLM-as-judge and hiring tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of 350 LLMs

Identified error correlation factors

Analyzed downstream task impacts

🔎 Similar Papers

Racing Thoughts: Explaining Large Language Model Contextualization Errors