Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

213K/year
πŸ€– AI Summary
Constructing high-quality, scalable multimodal scientific question-answering (MMQA) benchmarks is costly and labor-intensive. Method: This paper proposes the TQA-to-MMQA transformation frameworkβ€”an automated pipeline that converts text-only scientific QA pairs (TQAs) into high-fidelity MMQAs. It integrates large language model agents, multimodal content generation (e.g., figures, tables, equations), multi-dimensional quality assessment modeling, and human judgment alignment within a closed-loop iterative optimization system. Contributions/Results: (1) The first domain-specific multimodal scientific QA benchmark encompassing both generation and evaluation; (2) An interpretable, human-aligned multi-dimensional quality evaluation framework; (3) Empirical results show average MMQA quality scores increase from 78.90 to 85.22, and pass rate rises from 72% to 95%, demonstrating the feasibility and effectiveness of automated, large-scale construction of high-quality scientific multimodal benchmarks.

Technology Category

Application Category

πŸ“ Abstract
High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Transforming text-only QA pairs into multi-modal scientific benchmarks
Developing framework for generating and evaluating multi-modal QA quality
Creating agentic system to iteratively refine multi-modal scientific questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework transforms text-only QA into multi-modal QA
Agentic system integrates generation and evaluation in loop
Iterative refinement improves benchmark quality and pass rates
Junying Wang
Junying Wang
PhD Student at Shanghai AI Lab & Fudan University
LMM benchmarkAIGCAI Safety
Z
Zicheng Zhang
Shanghai Artificial Intelligence Laboratory
Ye Shen
Ye Shen
Baylor College of Medicine
Y
Yalun Wu
Shanghai Jiao Tong University
Y
Yingji Liang
Shanghai Artificial Intelligence Laboratory
Y
Yijin Guo
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
Farong Wen
Farong Wen
Student, Shanghai Jiao Tong University | Shanghai AI Laboratory
Intelligent Digital HumanLarge Lauguage ModelAI Evaluation
Wenzhe Li
Wenzhe Li
Princeton University
X
Xuezhi Zhao
Shanghai Artificial Intelligence Laboratory
Q
Qi Jia
Shanghai Artificial Intelligence Laboratory
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays