Improving Data and Reward Design for Scientific Reasoning in Large Language Models

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models in open-ended scientific reasoning, which stem from unreliable supervision signals and ambiguous evaluation criteria—challenges rooted in data construction and reward design during scientific post-training. To overcome these bottlenecks, the authors propose the Dr. SCI framework, which introduces a million-scale dataset spanning eight STEM domains and incorporates a novel fine-grained scoring rubric (SciRubric). The framework further integrates exploration-augmented supervised fine-tuning, dynamic-difficulty curriculum learning, and SciRubric-guided reinforcement learning. Evaluated on challenging scientific benchmarks, the Qwen3-4B-Base model trained under this framework achieves 63.2 on GPQA-Diamond and 32.4 on GPQA-General, substantially outperforming strong baselines such as o1-mini and GPT-4o, thereby demonstrating the efficacy and novelty of the proposed approach.

Technology Category

Application Category

📝 Abstract
Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT ->RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
Problem

Research questions and friction points this paper is trying to address.

scientific reasoning
open-ended questions
reward design
data construction
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific reasoning
open-ended evaluation
rubric-guided reinforcement learning
dynamic difficulty curriculum
large-scale science dataset
🔎 Similar Papers
No similar papers found.