ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited capability in end-to-end scientific collaboration. Method: This work proposes the ResearchGPT vision and introduces CS-54k—the first high-quality, domain-specific QA corpus covering the full research lifecycle in computer science (including the expert-annotated CS-4k subset). It innovatively models the entire research process with paper-level provenance, and designs a training framework integrating retrieval-augmented generation, multi-stage quality control, supervised fine-tuning, and reinforcement learning, built upon a scalable, CC-licensed scientific literature processing pipeline. Contribution/Results: Experiments demonstrate that domain-aligned, high-quality data substantially outperforms mere model scaling. The open-sourced 7B model achieves state-of-the-art performance across scientific assistance tasks, surpassing leading closed-source LLMs—including GPT-4.1, GPT-4o, and Gemini 2.5 Pro—validating the critical importance of data quality and domain adaptation.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for evaluating AI's ability to assist scientific research, and CS-50k, a large-scale training dataset. Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs into distinct capability tiers. Open models trained on CS-50k with supervised training and reinforcement learning demonstrate substantial improvements. Even 7B-scale models, when properly trained, outperform many larger proprietary systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that making AI models better research assistants relies more on domain-aligned training with high-quality data than on pretraining scale or general benchmark performance. We release CS-4k and CS-50k in the hope of fostering AI systems as reliable collaborators in CS research.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for end-to-end computer science research workflows

Creating high-quality training data for AI scientific research assistants

Evaluating AI's ability to assist throughout entire scientific research process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses retrieval-augmented generation for paper-grounded Q&A

Implements multi-stage quality control for factual grounding

Trains models with supervised and reinforcement learning methods

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Scientist, AI Language