MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

The open-source community lacks large-scale, high-quality, and verifiable scientific reasoning datasets, hindering the advancement of scientific AI. Method: This paper introduces two open-source datasets—TextbookReasoning and MegaScience—covering seven disciplines and fifteen benchmarks. It establishes the first ground-truth answer annotations derived systematically from university-level textbooks and proposes a scientific reasoning data mixing methodology alongside a systematic ablation experimental framework. Contribution/Results: By optimizing subset combinations, the work reveals scaling effects in scientific reasoning, substantially improving training efficiency and model performance. Evaluated on Llama-3.1, Qwen2.5, and Qwen3, the proposed approach consistently outperforms official instruction-tuned baselines, achieving higher accuracy with shorter response lengths. The datasets, methodology, and ablation framework collectively enable reproducible and scalable scientific AI research.

Technology Category

Application Category

📝 Abstract

Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

Problem

Research questions and friction points this paper is trying to address.

Lack of open large-scale scientific reasoning datasets

Need for high-quality verifiable scientific question-answer pairs

Absence of comprehensive evaluation system for science benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

TextbookReasoning dataset with 650k scientific questions

MegaScience dataset with 1.25M high-quality instances

Comprehensive evaluation system across 15 benchmarks

🔎 Similar Papers

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models