ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER

πŸ“… 2025-09-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address two key challenges in low-resource multimodal named entity recognition (GMNER)β€”the heavy reliance of supervised models on costly manual annotations and the domain knowledge conflicts inherent in multimodal large language models (MLLMs)β€”this paper proposes a three-stage collaborative framework. First, domain-aware synthetic data generation alleviates annotation scarcity. Second, an uncertainty-estimation-based task allocation mechanism dynamically partitions responsibilities between textual mention detection and visual grounding subtasks. Third, analogy-driven multimodal context selection mitigates LLM domain-knowledge interference. The framework freezes MLLM parameters and tightly couples a lightweight supervised model with an uncertainty estimation module. Evaluated on the CCKS2025 GMNER benchmark, it achieves an F1 score of 0.6461, ranking second overall, and demonstrates significant improvements in few-shot generalization and cross-modal alignment accuracy.

Technology Category

Application Category

πŸ“ Abstract
Grounded Multimodal Named Entity Recognition (GMNER) extends traditional NER by jointly detecting textual mentions and grounding them to visual regions. While existing supervised methods achieve strong performance, they rely on costly multimodal annotations and often underperform in low-resource domains. Multimodal Large Language Models (MLLMs) show strong generalization but suffer from Domain Knowledge Conflict, producing redundant or incorrect mentions for domain-specific entities. To address these challenges, we propose ReFineG, a three-stage collaborative framework that integrates small supervised models with frozen MLLMs for low-resource GMNER. In the Training Stage, a domain-aware NER data synthesis strategy transfers LLM knowledge to small models with supervised training while avoiding domain knowledge conflicts. In the Refinement Stage, an uncertainty-based mechanism retains confident predictions from supervised models and delegates uncertain ones to the MLLM. In the Grounding Stage, a multimodal context selection algorithm enhances visual grounding through analogical reasoning. In the CCKS2025 GMNER Shared Task, ReFineG ranked second with an F1 score of 0.6461 on the online leaderboard, demonstrating its effectiveness with limited annotations.
Problem

Research questions and friction points this paper is trying to address.

Improving low-resource Grounded Multimodal NER performance
Addressing domain knowledge conflict in multimodal LLMs
Reducing reliance on costly multimodal annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage collaborative framework integration
Domain-aware NER data synthesis strategy
Uncertainty-based prediction refinement mechanism
πŸ”Ž Similar Papers
No similar papers found.
Jielong Tang
Jielong Tang
School of Artificial Intelligence, Sun Yat-sen University
NLPKGMultimodal
S
Shuang Wang
Beijing Normal University
Zhenxing Wang
Zhenxing Wang
Finisar Corporation
Fiber Optic Communicaions
J
Jianxing Yu
School of Artificial Intelligence, Sun Yat-sen University; Key Laboratory of Sustainable Tourism Smart Assessment Technology, Ministry of Culture and Tourism, Sun Yat-sen University
J
Jian Yin
School of Artificial Intelligence, Sun Yat-sen University; Key Laboratory of Sustainable Tourism Smart Assessment Technology, Ministry of Culture and Tourism, Sun Yat-sen University