Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work presents the first systematic evaluation of Large Vision-Language Models (LVLMs) on Multimedia Event Extraction (M2E2), covering text-only, image-only, and cross-modal subtasks. To address the lack of prior benchmarking and fine-grained analysis, we conduct multimodal joint modeling and detailed error analysis using DeepSeek-VL2 and Qwen-VL under few-shot prompting and LoRA-based fine-tuning paradigms. Results reveal that LVLMs exhibit notable cross-modal synergy but suffer from key limitations: weak textual semantic parsing, imprecise event localization, and insufficient image-text alignment. LoRA fine-tuning significantly improves performance, confirming its efficacy for efficient LVLM adaptation. Our study establishes the first dedicated benchmark framework for LVLMs on M2E2, identifies critical modality-complementarity mechanisms, and pinpoints concrete optimization directions—thereby providing both methodological guidance and empirical foundations for multimodal event understanding.

Technology Category

Application Category

📝 Abstract
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on multimedia event extraction tasks
Assessing performance across text, image, and cross-modal subtasks
Identifying challenges in semantic precision and cross-modal grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of LVLMs on M2E2 tasks
Fine-tuning LVLMs with LoRA enhances performance
LVLMs show strong synergy in cross-modal settings
🔎 Similar Papers
No similar papers found.
F
Fuyu Xing
School of Advanced Technology, Xi’an Jiaotong-Liverpool University
Zimu Wang
Zimu Wang
Tsinghua University
recommendation
W
Wei Wang
School of Advanced Technology, Xi’an Jiaotong-Liverpool University
Haiyang Zhang
Haiyang Zhang
Nanjing University of Posts and Telecommunications
Wireless communication and signal processing