BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Large language models (LLMs) exhibit inconsistent performance across discourse understanding levels—lexical, sentence-level, and document-level—particularly on challenging tasks such as temporal reasoning, rhetorical relation identification, and discourse particle disambiguation. Method: We introduce BeDiscovER, the first multi-level, multilingual, multi-framework discourse understanding benchmark tailored for reasoning-oriented LLMs, integrating 52 datasets to systematically cover fine-grained discourse phenomena and cross-level tasks; it introduces novel challenges including discourse granularity disambiguation. Evaluation employs a unified framework and multidimensional protocols across state-of-the-art models—including Qwen3, DeepSeek-R1, and GPT-5-mini. Contribution/Results: Results reveal strong performance on temporal arithmetic reasoning but persistent bottlenecks in full-document inference and deep rhetorical relation recognition, highlighting critical gaps in holistic semantic comprehension and cross-sentential coherence modeling.

Technology Category

Application Category

📝 Abstract

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

Problem

Research questions and friction points this paper is trying to address.

Evaluating discourse-level knowledge of modern reasoning language models

Assessing performance across multi-level discourse tasks and datasets

Identifying limitations in document reasoning and semantic phenomena understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark suite for discourse-level knowledge evaluation

Compiles 52 datasets across multiple discourse levels

Evaluates LLMs on temporal reasoning and semantic challenges

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting