🤖 AI Summary
This study investigates cross-model correlations in error patterns of large language models (LLMs) and their implications for downstream tasks. Method: Leveraging large-scale empirical analysis across 350+ models on mainstream benchmarks and a realistic resume-screening task, we propose the concept of “algorithmic monoculture,” quantify multi-dimensional error correlations, measure cross-model error consistency, conduct systematic leaderboard evaluation, and perform scenario-based experiments. Contribution/Results: We首次 discover significant error consistency (up to 60%) among high-accuracy LLMs—unrelated to shared architecture or vendor—challenging the common assumption that model diversity inherently mitigates homogeneity risks. This monoculture amplifies biases in LLM-as-judge evaluations and exacerbates hiring discrimination. Our findings provide critical empirical evidence for robustness assessment and responsible deployment of LLMs, urging reevaluation of diversity-based risk mitigation strategies in practice.
📝 Abstract
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.