🤖 AI Summary
Existing LLM benchmarks predominantly evaluate explicit, single-sentence information extraction and lack multilingual assessment of implicit semantic and pragmatic reasoning—such as salience detection, entity tracking, discourse relations, and bridging inference—across sentences, paragraphs, multi-speaker dialogues, and cross-document contexts.
Method: We introduce DiscTrack, the first multilingual discourse tracking benchmark covering 12 languages and four granularities—sentence, paragraph, dialogue, and cross-document—and propose a large-model-based multitask evaluation framework integrating cross-lingual data construction and fine-grained discourse annotation.
Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art models on DiscTrack, exposing fundamental limitations in current multilingual LLMs’ implicit semantic modeling and long-range contextual integration. DiscTrack establishes a novel paradigm for rigorous, multilingual discourse understanding evaluation.
📝 Abstract
Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.