Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current small language models are typically evaluated on multiple-choice question answering through direct answer prediction, which fails to capture their capacity for tool-augmented, multi-step reasoning in real-world systems. This work proposes Code-Guided Reasoning (CGR), an evaluation protocol that leverages executable Python scaffolds to guide models toward structured reasoning. We introduce the first standardized six-component evaluation framework that integrates code generation, function call logs, and answer provenance. Experiments on 20,498 instances demonstrate that CGR-enhanced reasoning achieves a macro accuracy of 66.21%, substantially outperforming direct answering at 38.11%—a gain of 28.10 percentage points—thereby validating CGR’s effectiveness in enhancing and enabling fine-grained analysis of reasoning capabilities in small language models.

📝 Abstract

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

Problem

Research questions and friction points this paper is trying to address.

small language models

multiple-choice QA

reasoning scaffolds

executable code

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-Guided Reasoning

Small Language Models

Executable Scaffolds

MCQA

Program-aided Inference

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Scientist, AI Language