Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Constructing high-quality, scalable multimodal scientific question-answering (MMQA) benchmarks is costly and labor-intensive. Method: This paper proposes the TQA-to-MMQA transformation frameworkβ€”an automated pipeline that converts text-only scientific QA pairs (TQAs) into high-fidelity MMQAs. It integrates large language model agents, multimodal content generation (e.g., figures, tables, equations), multi-dimensional quality assessment modeling, and human judgment alignment within a closed-loop iterative optimization system. Contributions/Results: (1) The first domain-specific multimodal scientific QA benchmark encompassing both generation and evaluation; (2) An interpretable, human-aligned multi-dimensional quality evaluation framework; (3) Empirical results show average MMQA quality scores increase from 78.90 to 85.22, and pass rate rises from 72% to 95%, demonstrating the feasibility and effectiveness of automated, large-scale construction of high-quality scientific multimodal benchmarks.

Technology Category

Application Category

πŸ“ Abstract
High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Transforming text-only QA pairs into multi-modal scientific benchmarks
Developing framework for generating and evaluating multi-modal QA quality
Creating agentic system to iteratively refine multi-modal scientific questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework transforms text-only QA into multi-modal QA
Agentic system integrates generation and evaluation in loop
Iterative refinement improves benchmark quality and pass rates
πŸ”Ž Similar Papers
No similar papers found.
Junying Wang
Junying Wang
PhD Student at Shanghai AI Lab & Fudan University
LMM benchmarkAIGCAI Safety
Z
Zicheng Zhang
Shanghai Artificial Intelligence Laboratory
Ye Shen
Ye Shen
Baylor College of Medicine
Y
Yalun Wu
Shanghai Jiao Tong University
Y
Yingji Liang
Shanghai Artificial Intelligence Laboratory
Y
Yijin Guo
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
Farong Wen
Farong Wen
Student, Shanghai Jiao Tong University | Shanghai AI Laboratory
Intelligent Digital HumanLarge Lauguage ModelAI Evaluation
Wenzhe Li
Wenzhe Li
Princeton University
X
Xuezhi Zhao
Shanghai Artificial Intelligence Laboratory
Q
Qi Jia
Shanghai Artificial Intelligence Laboratory
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays