π€ AI Summary
Constructing high-quality, scalable multimodal scientific question-answering (MMQA) benchmarks is costly and labor-intensive. Method: This paper proposes the TQA-to-MMQA transformation frameworkβan automated pipeline that converts text-only scientific QA pairs (TQAs) into high-fidelity MMQAs. It integrates large language model agents, multimodal content generation (e.g., figures, tables, equations), multi-dimensional quality assessment modeling, and human judgment alignment within a closed-loop iterative optimization system. Contributions/Results: (1) The first domain-specific multimodal scientific QA benchmark encompassing both generation and evaluation; (2) An interpretable, human-aligned multi-dimensional quality evaluation framework; (3) Empirical results show average MMQA quality scores increase from 78.90 to 85.22, and pass rate rises from 72% to 95%, demonstrating the feasibility and effectiveness of automated, large-scale construction of high-quality scientific multimodal benchmarks.
π Abstract
High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.