π€ AI Summary
Mutual Reinforcement Effect (MRE), previously studied only in text-based information extraction, has not been explored for visual and multimodal scenarios.
Method: This paper introduces Multimodal Mutual Reinforcement Extraction (M-MRE), the first framework extending MRE to multimodal information extraction. We construct the first dedicated M-MRE benchmark for joint image-text understanding and propose a Prompt Format Adapter (PFA) enabling plug-and-play integration with diverse Large Vision-Language Models (LVLMs). Our approach unifies text extraction, image understanding, and cross-modal reasoning via multimodal joint modeling and cross-granularity three-task collaborative learning.
Contribution/Results: Extensive experiments demonstrate significant performance gains across multiple downstream tasks, validating MREβs effectiveness and generalizability in multimodal settings. M-MRE establishes a novel paradigm and scalable technical pathway for multimodal information extraction, advancing beyond unimodal MRE principles.
π Abstract
Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.