EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
Existing event extraction datasets commonly suffer from limited event type coverage, domain closure, and a lack of large-scale human validation. To address these limitations, this work introduces EVENT5Ws—a large-scale, human-annotated, and statistically validated open-domain event extraction dataset. Through a systematic annotation pipeline and rigorous quality control mechanisms, EVENT5Ws achieves, for the first time, broad cross-regional coverage of diverse event types. The study also establishes strong baselines leveraging pretrained large language models. Experimental results demonstrate that EVENT5Ws substantially enhances model generalization across varied geographical contexts, offering a reliable resource and practical guidance for advancing open-domain event extraction research.

Technology Category

Application Category

📝 Abstract
Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.
Problem

Research questions and friction points this paper is trying to address.

event extraction
open-domain
dataset limitations
manual annotation
large-scale dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-domain event extraction
large-scale annotated dataset
systematic annotation pipeline
model generalization
benchmark evaluation
🔎 Similar Papers
No similar papers found.