WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) in scientific reasoning, which stem from the scarcity of high-quality training data and the inherent complexity of scientific problems. To overcome these challenges, the authors introduce WildSci—a novel framework that automatically synthesizes multiple-choice scientific questions from peer-reviewed literature, spanning nine disciplines and 26 subfields, to construct a large-scale, structured reasoning dataset. Leveraging this dataset, they fine-tune LLMs using reinforcement learning to generate scalable and effective training signals. The proposed approach significantly improves model performance across multiple scientific reasoning benchmarks, demonstrating the efficacy of both the synthetic data generation pipeline and the tailored training strategy. The WildSci dataset is publicly released to foster sustainable advancement in LLM-based scientific reasoning research.

Technology Category

Application Category

📝 Abstract

Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.

Problem

Research questions and friction points this paper is trying to address.

scientific reasoning

large language models

dataset scarcity

open-ended scientific questions

domain complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific reasoning

automatically synthesized dataset

reinforcement learning