🤖 AI Summary
Event extraction faces a fundamental trade-off between low recall of discriminative models and severe hallucination in generative models. To address this, we propose ARIS, the first framework that deeply integrates a self-mixing agent with a discriminative sequence tagger. ARIS synergistically mitigates hallucination via model-consensus reasoning, confidence-based filtering, and LLM-driven reflective refinement. Furthermore, it introduces decomposition-based instruction tuning to explicitly encode event structure, thereby enhancing the LLM’s capacity for structured understanding. Evaluated on ACE2005, E2E, and RAMS benchmarks, ARIS substantially outperforms state-of-the-art methods, achieving simultaneous improvements in both precision and recall for trigger identification and argument extraction. The framework delivers robust, high-coverage, end-to-end event extraction without compromising fidelity or structural integrity.
📝 Abstract
Event Extraction (EE) involves automatically identifying and extracting structured information about events from unstructured text, including triggers, event types, and arguments. Traditional discriminative models demonstrate high precision but often exhibit limited recall, particularly for nuanced or infrequent events. Conversely, generative approaches leveraging Large Language Models (LLMs) provide higher semantic flexibility and recall but suffer from hallucinations and inconsistent predictions. To address these challenges, we propose Agreement-based Reflective Inference System (ARIS), a hybrid approach combining a Self Mixture of Agents with a discriminative sequence tagger. ARIS explicitly leverages structured model consensus, confidence-based filtering, and an LLM reflective inference module to reliably resolve ambiguities and enhance overall event prediction quality. We further investigate decomposed instruction fine-tuning for enhanced LLM event extraction understanding. Experiments demonstrate our approach outperforms existing state-of-the-art event extraction methods across three benchmark datasets.