π€ AI Summary
This work addresses the susceptibility of multimodal large language models to modality bias in grounded named entity recognition (GMNER), where models often rely on unimodal shortcuts rather than performing rigorous cross-modal reasoning. To mitigate this issue, the authors propose the Modality-aware Consistency Reasoning (MCR) framework, which integrates Multi-style Reasoning Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO), further aligned through Group Relative Policy Optimization (GRPO) to harmonize reasoning trajectories and foster structured cross-modal verification. Experimental results demonstrate that the proposed approach significantly outperforms existing baselines on both GMNER and visual grounding tasks, effectively alleviating modality bias and enhancing the modelβs capacity for robust cross-modal reasoning.
π Abstract
Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.