PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing benchmarks struggle to capture the complexity of multilingual code-switching in multi-turn, multi-party conversations. To address this gap, this work introduces a human-authored dataset of multi-party (2–4 participants) code-switching dialogues spanning five language combinations—including trilingual mixes—and supports three tasks: question answering, summarization, and topic classification. This dataset provides the first structurally diverse and highly natural benchmark for multi-party code-switching dialogue, significantly outperforming machine-generated alternatives in turn length variability, speaker dominance patterns, and response span diversity. Experimental results using a multilingual mixed-language modeling framework under a multi-task evaluation setup reveal that current mainstream language models exhibit limited performance on such inputs, underscoring the urgent need to advance multilingual dialogue understanding capabilities.

Technology Category

Application Category

📝 Abstract

Code-switching is a widespread practice among the world's multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of which are trilingual. Our dataset consists of human-authored conversations among 2 to 4 participants covering authentic, multi-threaded structures where replies frequently reference much earlier points in the dialogue. We demonstrate that our data is significantly more natural and structurally diverse than machine-generated alternatives, offering greater variation in message length, speaker dominance, and reply distance. Based on these dialogues, we define three downstream tasks: Question Answering, Dialogue Summarization, and Topic Classification. Evaluations of several state-of-the-art language models on PingPong reveal that performance remains limited on code-switched inputs, underscoring the urgent need for more robust NLP systems capable of addressing the intricacies of real-world multilingual discourse.

Problem

Research questions and friction points this paper is trying to address.

code-switching

multi-turn dialogue

multilingual

natural conversation

dialogue benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

code-switching

multi-turn dialogue

natural benchmark