DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing LLM benchmarks predominantly evaluate explicit, single-sentence information extraction and lack multilingual assessment of implicit semantic and pragmatic reasoning—such as salience detection, entity tracking, discourse relations, and bridging inference—across sentences, paragraphs, multi-speaker dialogues, and cross-document contexts. Method: We introduce DiscTrack, the first multilingual discourse tracking benchmark covering 12 languages and four granularities—sentence, paragraph, dialogue, and cross-document—and propose a large-model-based multitask evaluation framework integrating cross-lingual data construction and fine-grained discourse annotation. Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art models on DiscTrack, exposing fundamental limitations in current multilingual LLMs’ implicit semantic modeling and long-range contextual integration. DiscTrack establishes a novel paradigm for rigorous, multilingual discourse understanding evaluation.

Technology Category

Application Category

📝 Abstract

Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

Problem

Research questions and friction points this paper is trying to address.

Benchmark tests discourse tracking across multiple languages

Evaluates implicit information and pragmatic inference in documents

Assesses information integration across sentences and speaker utterances

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark for discourse tracking

Tests four levels of discourse understanding

Evaluates information integration across documents

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems