🤖 AI Summary
Large language models (xLLMs) exhibit structural understanding bottlenecks in multilingual, multiparty dialogue, yet existing benchmarks lack realistic complexity and multilingual parallelism. Method: We introduce XMP—a high-quality, multilingual parallel dialogue benchmark derived from authentic multiparty podcasts—featuring ≥3 participants per sample, covering sociocultural and political topics, with fine-grained dialogue structure annotations and cross-lingual consistency evaluation. Contribution/Results: Our empirical analysis reveals critical deficiencies: xLLMs achieve only 52% role-tracking accuracy and suffer a 37% drop in response coherence across languages, challenging the prevailing “multilingual complementarity” hypothesis. We propose a novel paradigm for modeling complex dialogue grounded in real-world podcast data, supported by controlled generation experiments and mechanistic analysis. The XMP dataset and evaluation framework are publicly released to advance standardized assessment of multilingual multiparty dialogue understanding.
📝 Abstract
Multilingual research has garnered increasing attention, especially in the domain of dialogue systems. The rapid advancements in large language models (LLMs) have fueled the demand for high-performing multilingual models. However, two major challenges persist: the scarcity of high-quality multilingual datasets and the limited complexity of existing datasets in capturing realistic dialogue scenarios. To address these gaps, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues. Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and entertainment.Through extensive experiments, we uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios. For instance, the widely accepted multilingual complementary ability of LLMs is notably impacted. By conducting further experiments, we explore the mechanisms of LLMs in multilingual environments from multiple perspectives, shedding new light on their performance in real-world, diverse conversational contexts.