Empirically evaluating commonsense intelligence in large language models with large-scale human judgments

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses the limitation of conventional commonsense intelligence evaluation—its reliance on static ground-truth labels while ignoring heterogeneity in human judgment. We propose a novel, large-scale crowdsourcing-based paradigm that abandons predefined answers and instead quantifies alignment between large language model (LLM) outputs and empirically collected human judgments. Crucially, we explicitly model human judgment heterogeneity as the foundational basis for evaluation, thereby revealing the cultural embeddedness of commonsense knowledge. Our method introduces two complementary protocols: “virtual respondents” and a “population simulator,” enabling cross-model–population correlation analysis. Results show that most LLMs perform below the human median; simulated population-level consensus achieves only moderate correlation with real human consensus; and several small, open-source models significantly outperform larger, closed-source models on commonsense alignment tasks. This work redefines evaluation rigor by grounding commonsense assessment in empirical human diversity rather than artificial uniformity.

Technology Category

Application Category

📝 Abstract

Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

Problem

Research questions and friction points this paper is trying to address.

Evaluating commonsense intelligence in LLMs with human heterogeneity

Assessing LLM-human agreement on commonsense judgments

Comparing smaller open-weight models vs larger proprietary models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates human heterogeneity in commonsense evaluation

Uses LLMs as simulators of human populations

Compares model judgments with human population responses

🔎 Similar Papers

No similar papers found.