SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs’) capacity to reason about abstract social concepts such as norms, culture, and institutions. To bridge this gap, the authors propose the SCRuB framework, which introduces an “expert-anchored” assessment paradigm: structured prompts derived from authoritative sources elicit responses from both LLMs and human experts, which are then comparatively evaluated using a five-dimensional critical thinking rubric and a multidisciplinary panel of expert judges. Across 1,170 paired comparisons, expert evaluators rated model responses as superior in 80.8% of cases, yielding an overall preference rate of 74.4%. These findings indicate that state-of-the-art LLMs now outperform human experts on such reasoning tasks and further reveal that conventional single-turn, exam-style evaluations have reached a performance ceiling—termed “evaluation saturation.”
📝 Abstract
While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.
Problem

Research questions and friction points this paper is trying to address.

social concept reasoning
rubric-based evaluation
large language models
critical thinking
evaluation methodology
Innovation

Methods, ideas, or system contributions that make the work stand out.

social concept reasoning
rubric-based evaluation
expert-grounded assessment
evaluation saturation
disciplinary perspectives ensemble
🔎 Similar Papers
No similar papers found.