HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing Hebrew NLP benchmarks emphasize morphosyntactic tasks and lack semantic-focused machine reading comprehension (MRC) resources. Method: We introduce HeQ, the first large-scale extractive MRC benchmark for Hebrew, comprising 30,147 question-answer pairs drawn from Wikipedia and technology news. To address challenges arising from Hebrew’s rich morphology—including ambiguous word boundaries, annotation inconsistency, and the unreliability of standard F1/Exact Match (EM) metrics—we propose a morphology-aware annotation guideline, a controlled crowdsourcing protocol, a boundary-sensitive extraction strategy, and revised evaluation metrics. Contribution/Results: Empirical analysis reveals that conventional metrics substantially overestimate model performance and that syntactic and semantic task performances are weakly correlated. HeQ fills a critical gap in Hebrew semantic understanding evaluation and establishes a transferable paradigm—along with novel assessment standards—for building MRC benchmarks for highly inflected languages.

Technology Category

Application Category

📝 Abstract

Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models' performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs.

Problem

Research questions and friction points this paper is trying to address.

Lack of semantic-focused Hebrew NLP benchmarks

Challenges in Hebrew MRC due to morphological complexity

Inadequacy of standard metrics for Hebrew MRC evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel guidelines for Hebrew MRC annotation

Controlled crowdsourcing protocol for QA pairs

Enhanced evaluation metrics for morphologically rich languages

🔎 Similar Papers

HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew