Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Rapid scientific knowledge evolution causes static evaluation benchmarks to become outdated, hindering accurate assessment of domain-specific language model capabilities. Method: This paper introduces a scalable, modular multiple-choice question (MCQ) generation framework for radiology and cancer biology, leveraging distilled GPT-4.1 reasoning traces as retrieval sources—novel in scientific QA generation. Integrated with PDF parsing, semantic chunking, and automated question-answer synthesis, the framework implements a retrieval-augmented generation (RAG) pipeline. Contribution/Results: It generates over 16,000 high-quality, expert-validated MCQs, rigorously evaluated via both expert annotation and synthetic benchmarking. Experiments demonstrate that reasoning-trace–driven retrieval substantially enhances performance of small-parameter models, with several outperforming GPT-4 on domain-specific professional examinations. This work establishes a new paradigm for dynamically updating scientific evaluation benchmarks and enabling efficient domain adaptation of lightweight language models.

Technology Category

Application Category

📝 Abstract
As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.
Problem

Research questions and friction points this paper is trying to address.

Automating MCQA benchmark creation from scientific corpora
Evaluating small language models with retrieval-augmented generation
Improving domain adaptation using reasoning traces as retrieval sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for MCQA benchmark generation
Retrieval-augmented generation using reasoning traces
Semantic chunking from scientific paper corpora
🔎 Similar Papers
No similar papers found.