Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates whether instruction tuning genuinely enhances the reasoning capabilities of large language models or merely reinforces superficial pattern matching. Through systematic evaluation using zero-shot and few-shot chain-of-thought (CoT) prompting on standard mathematical benchmarks (e.g., GSM8K), structurally perturbed variants, and out-of-domain tasks (e.g., MedCalc), the authors compare base models against their instruction-tuned counterparts. The findings reveal that base models significantly outperform instruction-tuned models under zero-shot CoT—by as much as 32.6 percentage points for Llama3-70B—while the latter only match performance in few-shot settings. Moreover, instruction-tuned models exhibit markedly weaker robustness under domain shifts and input perturbations. This work is the first to demonstrate that the purported advantages of instruction tuning are highly sensitive to prompting strategies and can be reversed in domain-shifted scenarios such as MedCalc, where base models surpass tuned variants.

Technology Category

Application Category

📝 Abstract

Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67\% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift. Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark. Additionally, instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.

Problem

Research questions and friction points this paper is trying to address.

instruction tuning

reasoning robustness

domain shift

prompt sensitivity

mathematical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction tuning

reasoning robustness

distribution shift