RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimedia event extraction faces significant challenges, including scarce annotated data, difficulties in cross-modal semantic alignment, and insufficient learning of structured representations. This work proposes a relation-aware multi-task progressive learning framework that introduces, for the first time, a staged training strategy to integrate heterogeneous supervisory signals from unimodal event extraction and multimodal relation extraction. By leveraging this approach, the model learns shared cross-modal event representations without requiring end-to-end annotations. The method combines vision-language models with a unified event schema, substantially improving performance on both event mention identification and argument role extraction. Consistent and significant gains are demonstrated across multiple vision-language models on the M2E2 benchmark.

Technology Category

Application Category

📝 Abstract
Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
Problem

Research questions and friction points this paper is trying to address.

Multimedia Event Extraction
low-resource learning
cross-modal grounding
structured event representation
multimodal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimedia Event Extraction
Progressive Learning
Relation-aware
Stage-wise Training
Low-resource Learning
🔎 Similar Papers
No similar papers found.
Y
Yongkang Jin
School of Computer Science and Technology, Soochow University, Suzhou, China
J
Jianwen Luo
School of Computer Science and Technology, Soochow University, Suzhou, China
Jingjing Wang
Jingjing Wang
Professor, School of Cyber Science and Technology, Beihang University
AI for WirelessUAV NetworksSpace-Air-Ground-Sea NetworksCommunication Security
J
Jianmin Yao
School of Computer Science and Technology, Soochow University, Suzhou, China
Yu Hong
Yu Hong
Colorado School of Mines; University of Florida
Traumatic Brain injuryComputational MechanicsCavitation