🤖 AI Summary
This work presents the first systematic evaluation of Large Vision-Language Models (LVLMs) on Multimedia Event Extraction (M2E2), covering text-only, image-only, and cross-modal subtasks. To address the lack of prior benchmarking and fine-grained analysis, we conduct multimodal joint modeling and detailed error analysis using DeepSeek-VL2 and Qwen-VL under few-shot prompting and LoRA-based fine-tuning paradigms. Results reveal that LVLMs exhibit notable cross-modal synergy but suffer from key limitations: weak textual semantic parsing, imprecise event localization, and insufficient image-text alignment. LoRA fine-tuning significantly improves performance, confirming its efficacy for efficient LVLM adaptation. Our study establishes the first dedicated benchmark framework for LVLMs on M2E2, identifies critical modality-complementarity mechanisms, and pinpoints concrete optimization directions—thereby providing both methodological guidance and empirical foundations for multimodal event understanding.
📝 Abstract
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.