MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-source, multimodal question answering (QA) faces challenges in dynamically selecting and fusing optimal answer sources across heterogeneous modalities (e.g., text, images). Method: We propose a question-guided dense-sparse hybrid Mixture-of-Experts (MoE) architecture comprising: (i) a question-guided cross-source attention mechanism and intra-source explicit alignment module for fine-grained cross-modal relevance modeling; (ii) a scalable sparse MoE router enabling adaptive decision-making over hundreds of QA types; and (iii) a T5/Flan-T5–based improved joint decoding framework. Contribution/Results: Our method achieves significant improvements over state-of-the-art on three multi-source multimodal QA benchmarks. Ablation studies confirm the efficacy of each component. The model demonstrates strong robustness, cross-modal generalization, and high scalability, establishing a novel paradigm for joint reasoning across heterogeneous multimodal sources.

Technology Category

Application Category

📝 Abstract
Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model's efficacy, supported by ablation studies.
Problem

Research questions and friction points this paper is trying to address.

Handles multi-source, multi-modal information for QA and VQA.
Selects optimal answer sources using question-guided attention.
Scales to diverse questions via sparse mixture-of-experts framework.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Question-guided attention mechanism for multi-source data
Explicit alignment between questions and information sources
Sparse mixture-of-experts framework for scalability
🔎 Similar Papers
No similar papers found.