🤖 AI Summary
To address the limited cross-cultural adaptability of large language models (LLMs), this paper proposes a multi-LLM agent debate framework. It employs two collaborative agents that either engage in culturally grounded deliberation or dynamically switch to introspective reasoning, thereby enhancing fairness and cultural sensitivity in social norm judgment. The framework introduces a novel dynamic debate–introspection scheduling mechanism and establishes NormAd-ETI, a comprehensive evaluation benchmark covering etiquette norms from 75 countries. Extensive experiments are conducted across seven open-source LLMs and 21 model combinations. Results demonstrate significant improvements in overall accuracy and inter-cultural group fairness. Notably, 7–9B parameter models, when augmented with debate, achieve performance comparable to a standalone 27B model—achieving superior generalization and inference efficiency without sacrificing accuracy.
📝 Abstract
Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).