MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

📅 2024-08-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing multimodal question-answering systems struggle to generate comprehensive, multimodal answers—incorporating text, images, and video—that support conceptual explanations and step-by-step tutorials, limiting their applicability in enterprise customer service and educational settings. To address this, we propose MuRAR, the first lightweight, end-to-end multimodal answer refinement framework that adapts existing chatbots without model retraining. Methodologically, MuRAR integrates cross-modal retrieval (text–image/video), multi-granularity answer alignment, LLM-driven modality-aware rewriting, and consistency optimization to jointly leverage and reconstruct responses from heterogeneous multimodal data. Human evaluation demonstrates that MuRAR’s outputs significantly outperform text-only baselines in both usefulness and readability, while maintaining low deployment overhead and strong scalability across diverse domains and modalities.

Technology Category

Application Category

📝 Abstract

Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This framework can be easily extended to support multimodal answers in enterprise chatbots with minimal modifications. Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.

Problem

Research questions and friction points this paper is trying to address.

Enhances text-based answers with multimodal data.

Generates comprehensive multimodal answers for complex queries.

Improves enterprise chatbots with coherent multimodal responses.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Retrieval and Refinement

Enhances Text-based Answers

Supports Enterprise Chatbots

🔎 Similar Papers

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models