From Query to Explanation: Uni-RAG for Multi-Modal Retrieval-Augmented Learning in STEM

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing educational retrieval systems struggle with real-world challenges in STEM education—such as heterogeneous query styles, abstract semantics, and ambiguous user intents. To address this, we propose Uni-Retrieval: a unified framework integrating multimodal retrieval with style-aware generation. It introduces two key innovations—the first dynamic Prompt Bank and an MoE-LoRA collaborative mechanism—enabling adaptive matching to unseen query styles. Additionally, we design a lightweight instruction-tuned language model to establish an end-to-end, interpretable content generation pipeline. Evaluated on multimodal benchmarks including SER, Uni-Retrieval achieves significant improvements in both retrieval accuracy and generation quality, while reducing computational overhead by 37% and requiring only 12% of the parameters of the base model. The framework demonstrates strong scalability and pedagogical practicality.

Technology Category

Application Category

📝 Abstract
In AI-facilitated teaching, leveraging various query styles to interpret abstract educational content is crucial for delivering effective and accessible learning experiences. However, existing retrieval systems predominantly focus on natural text-image matching and lack the capacity to address the diversity and ambiguity inherent in real-world educational scenarios. To address this limitation, we develop a lightweight and efficient multi-modal retrieval module, named Uni-Retrieval, which extracts query-style prototypes and dynamically matches them with tokens from a continually updated Prompt Bank. This Prompt Bank encodes and stores domain-specific knowledge by leveraging a Mixture-of-Expert Low-Rank Adaptation (MoE-LoRA) module and can be adapted to enhance Uni-Retrieval's capability to accommodate unseen query types at test time. To enable natural language educational content generation, we integrate the original Uni-Retrieval with a compact instruction-tuned language model, forming a complete retrieval-augmented generation pipeline named Uni-RAG. Given a style-conditioned query, Uni-RAG first retrieves relevant educational materials and then generates human-readable explanations, feedback, or instructional content aligned with the learning objective. Experimental results on SER and other multi-modal benchmarks show that Uni-RAG outperforms baseline retrieval and RAG systems in both retrieval accuracy and generation quality, while maintaining low computational cost. Our framework provides a scalable, pedagogically grounded solution for intelligent educational systems, bridging retrieval and generation to support personalized, explainable, and efficient learning assistance across diverse STEM scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addressing diversity and ambiguity in educational query styles
Enhancing retrieval accuracy for multi-modal educational content
Generating human-readable explanations from retrieved materials
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight multi-modal retrieval module
Mixture-of-Expert Low-Rank Adaptation
Retrieval-augmented generation pipeline
X
Xinyi Wu
School of Education, Shanghai Jiao Tong University, Shanghai, China, 200240
Yanhao Jia
Yanhao Jia
Nanyang Technological University
Artificial IntelligenceDeep LearningComputational Neuroscience
Luwei Xiao
Luwei Xiao
Nanyang Technological University
LLMsMultimodal InteractionSentiment AnalysisHuman-in-the-loopAI for Healthcare
S
Shuai Zhao
College of Computing and Data Science, Nanyang Technological University, Singapore, 639798
F
Fengkuang Chiang
School of Education, Shanghai Jiao Tong University, Shanghai, China, 200240
Erik Cambria
Erik Cambria
Professor @ NTU CCDS & Visiting @ MIT Media Lab
Neurosymbolic AIMultimodal InteractionNLPAffective ComputingSentiment Analysis