🤖 AI Summary
AI research assistants struggle to generate research plans aligned with domain-specific objectives and implicit scholarly norms. Method: This paper proposes a novel paradigm integrating automated rubric construction and self-evaluative reinforcement learning: (1) jointly extracting research objectives and domain-specific evaluation rubrics from cross-disciplinary papers to curate high-quality training data; (2) designing an unsupervised, generator–reviewer discrepancy-driven optimization framework using a frozen-policy PPO algorithm that eliminates reliance on human feedback; and (3) enhancing generalization via multi-domain transfer fine-tuning and large-model cross-evaluation. Results: Fine-tuned Qwen3-30B-A3B achieves 70% expert preference score, 84% rubric extraction accuracy, and 12–22% performance gains on out-of-domain benchmarks (medicine, new preprints), demonstrating strong cross-disciplinary generalizability and methodological novelty.
📝 Abstract
AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.