CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address key bottlenecks in multimodal RAG systems—including inaccurate user intent understanding, monolithic retrieval strategies, and weak inappropriate response filtering—this paper proposes an end-to-end, multi-stage framework. First, it introduces an image-context-enhanced intent refinement module to improve query semantic accuracy. Second, it designs an intent-driven, context-aware query generation mechanism coupled with heterogeneous API collaborative retrieval. Third, it pioneers a dynamic, organization-aware three-tier joint filtering mechanism—operating over images, text, and multimodal representations—to enable fine-grained relevance and safety control. The method integrates multimodal large language models (MLLMs), cross-modal classifiers, policy-aware dynamic filtering, and API-integrated architecture. Extensive evaluation on multiple public benchmarks—including knowledge-intensive visual question answering (VQA) and safety-oriented datasets—as well as real-world data demonstrates consistent superiority over state-of-the-art methods, setting new records across several key metrics.

Technology Category

Application Category

📝 Abstract

The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Extensive experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results, advancing the capabilities of multimodal retrieval systems.

Problem

Research questions and friction points this paper is trying to address.

Improves accuracy in interpreting user intent in multimodal search

Enhances retrieval strategies and filters inappropriate responses effectively

Integrates image context and dynamic filtering for better relevance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage pipeline for multimodal search

Dynamic filtering with multimodal classifiers

Integration of external APIs for enrichment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs