🤖 AI Summary
To address key bottlenecks in multimodal RAG systems—including inaccurate user intent understanding, monolithic retrieval strategies, and weak inappropriate response filtering—this paper proposes an end-to-end, multi-stage framework. First, it introduces an image-context-enhanced intent refinement module to improve query semantic accuracy. Second, it designs an intent-driven, context-aware query generation mechanism coupled with heterogeneous API collaborative retrieval. Third, it pioneers a dynamic, organization-aware three-tier joint filtering mechanism—operating over images, text, and multimodal representations—to enable fine-grained relevance and safety control. The method integrates multimodal large language models (MLLMs), cross-modal classifiers, policy-aware dynamic filtering, and API-integrated architecture. Extensive evaluation on multiple public benchmarks—including knowledge-intensive visual question answering (VQA) and safety-oriented datasets—as well as real-world data demonstrates consistent superiority over state-of-the-art methods, setting new records across several key metrics.
📝 Abstract
The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Extensive experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results, advancing the capabilities of multimodal retrieval systems.