LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current evaluations of large language models predominantly emphasize task performance while overlooking cultural appropriateness and assessment reliability. This work proposes the first benchmark framework that integrates multicultural contexts, dynamic social interactions, and a dual-dimensional evaluation encompassing both task completion and norm adherence. The framework constructs a virtual town through location-graph-based multi-agent simulation, embedding language models as resident agents, and introduces an LLM-driven norm adjudicator alongside a mechanism for quantifying evaluator uncertainty. Experimental results reveal cross-cultural robustness disparities among models, trade-offs between task success and norm compliance, and delineate the effective boundaries of automated evaluation while underscoring the necessity of human oversight.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms. The simulation models a small city as a location graph with synthetic residents having diverse demographic and cultural profiles. Each episode assigns one resident a daily goal while others provide social context. An LLM-based verifier generates structured judgments on norm violations and task progress, which we aggregate into metrics capturing task-norm trade-offs and verifier uncertainty. Using LiveCultureBench across models and cultural profiles, we study (i) cross-cultural robustness of LLM agents, (ii) how they balance effectiveness against norm sensitivity, and (iii) when LLM-as-a-judge evaluation is reliable for automated benchmarking versus when human oversight is needed.

Problem

Research questions and friction points this paper is trying to address.

cultural appropriateness

LLM evaluation

social norms

multi-cultural benchmark

evaluator reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent simulation

cultural appropriateness

dynamic social benchmark