🤖 AI Summary
To address the limited generalization capability of conventional text-based question answering (QA) systems amid the explosive growth of multimedia content, this paper presents a systematic survey of retrieval-augmented multimodal question answering (MMQA). We propose the first unified taxonomy for MMQA encompassing images, text, audio, and structured metadata, centered on three core challenges: cross-modal alignment, semantic grounding, and low-latency, high-accuracy joint inference. Our framework integrates key techniques—including multimodal representation learning, cross-modal attention, retrieval-augmented generation (RAG), and neuro-symbolic reasoning—and conducts comprehensive empirical evaluation across mainstream benchmarks. Results reveal fundamental trade-offs among accuracy, efficiency, and generalization in existing approaches. We identify context-aware enhancement and lightweight inference as two critical future directions. This work provides both theoretical foundations and practical guidelines for developing robust, scalable MMQA systems.
📝 Abstract
Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.