🤖 AI Summary
Rapid scientific knowledge evolution causes static evaluation benchmarks to become outdated, hindering accurate assessment of domain-specific language model capabilities.
Method: This paper introduces a scalable, modular multiple-choice question (MCQ) generation framework for radiology and cancer biology, leveraging distilled GPT-4.1 reasoning traces as retrieval sources—novel in scientific QA generation. Integrated with PDF parsing, semantic chunking, and automated question-answer synthesis, the framework implements a retrieval-augmented generation (RAG) pipeline.
Contribution/Results: It generates over 16,000 high-quality, expert-validated MCQs, rigorously evaluated via both expert annotation and synthetic benchmarking. Experiments demonstrate that reasoning-trace–driven retrieval substantially enhances performance of small-parameter models, with several outperforming GPT-4 on domain-specific professional examinations. This work establishes a new paradigm for dynamically updating scientific evaluation benchmarks and enabling efficient domain adaptation of lightweight language models.
📝 Abstract
As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.