NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current evaluations of social intelligence in large language models lack fine-grained diagnostic capabilities grounded in a unified theoretical framework. To address this gap, this work integrates psychological and social theories to construct a comprehensive social intelligence framework comprising four major categories and eleven dimensions. Building upon this framework, the authors propose NICE—the first structured benchmark tailored to Chinese-language contexts—encompassing 137 fine-grained capability indicators. NICE uniquely combines psychometric principles with social theory, ensuring validity through systematic literature review, multi-stage expert validation, and contextualized task design. Experiments across five state-of-the-art large language models and a human control group demonstrate that NICE effectively identifies three systematic weaknesses in the Communication dimension: multi-turn interaction, nonverbal communication, and synchrony.

📝 Abstract

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

Problem

Research questions and friction points this paper is trying to address.

social intelligence

large language models

diagnostic benchmark

human-AI interaction

evaluation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

social intelligence

diagnostic benchmark

theory-grounded evaluation