ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the cascading errors in multimodal event extraction caused by cross-modal alignment inaccuracies by proposing the ECHO framework. ECHO employs multiple agents performing atomic operations on a shared Multimodal Event Hypergraph (MEHG) to iteratively refine event structures. The key innovations include introducing an event-centric hypergraph as an explicit intermediate representation, designing a multi-agent collaboration mechanism grounded in hypergraph operations, and adopting a Link-then-Bind strategy that decouples argument identification from role binding to effectively suppress error propagation. Evaluated on the M2E2 benchmark with Qwen3-32B, the approach achieves substantial improvements of 7.3% and 15.5% in F1 scores for event mention and argument role extraction, respectively, significantly outperforming existing methods.

Technology Category

Application Category

πŸ“ Abstract
Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.
Problem

Research questions and friction points this paper is trying to address.

Multimedia Event Extraction
cascading errors
cross-modal misalignment
argument role assignment
event grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimedia Event Extraction
Hypergraph Operations
Multi-Agent Collaboration
Link-then-Bind
Error Propagation Mitigation
πŸ”Ž Similar Papers
No similar papers found.
H
Hailong Chu
Beijing University of Posts and Telecommunications
S
Shuo Zhang
Beijing University of Posts and Telecommunications
Y
Yunlong Chu
Tianjin University
S
Shutai Huang
Beijing University of Posts and Telecommunications
X
Xingyue Zhang
Beijing University of Posts and Telecommunications
T
Tinghe Yan
Chongqing University of Post and Telecommunications
Jinsong Zhang
Jinsong Zhang
UniversitΓ© Laval
Computer VisionDeep LearningComputer Graphics
L
Lei Li
Beijing University of Posts and Telecommunications