Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Constructing high-quality, scalable multimodal scientific question-answering (MMQA) benchmarks is costly and labor-intensive. Method: This paper proposes the TQA-to-MMQA transformation framework—an automated pipeline that converts text-only scientific QA pairs (TQAs) into high-fidelity MMQAs. It integrates large language model agents, multimodal content generation (e.g., figures, tables, equations), multi-dimensional quality assessment modeling, and human judgment alignment within a closed-loop iterative optimization system. Contributions/Results: (1) The first domain-specific multimodal scientific QA benchmark encompassing both generation and evaluation; (2) An interpretable, human-aligned multi-dimensional quality evaluation framework; (3) Empirical results show average MMQA quality scores increase from 78.90 to 85.22, and pass rate rises from 72% to 95%, demonstrating the feasibility and effectiveness of automated, large-scale construction of high-quality scientific multimodal benchmarks.

Technology Category

Application Category

📝 Abstract

High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Transforming text-only QA pairs into multi-modal scientific benchmarks

Developing framework for generating and evaluating multi-modal QA quality

Creating agentic system to iteratively refine multi-modal scientific questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework transforms text-only QA into multi-modal QA

Agentic system integrates generation and evaluation in loop

Iterative refinement improves benchmark quality and pass rates

🔎 Similar Papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

2024-07-12arXiv.orgCitations: 4

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

2023-03-02Citations: 131