π€ AI Summary
This study investigates whether instruction tuning genuinely enhances the reasoning capabilities of large language models or merely reinforces superficial pattern matching. Through systematic evaluation using zero-shot and few-shot chain-of-thought (CoT) prompting on standard mathematical benchmarks (e.g., GSM8K), structurally perturbed variants, and out-of-domain tasks (e.g., MedCalc), the authors compare base models against their instruction-tuned counterparts. The findings reveal that base models significantly outperform instruction-tuned models under zero-shot CoTβby as much as 32.6 percentage points for Llama3-70Bβwhile the latter only match performance in few-shot settings. Moreover, instruction-tuned models exhibit markedly weaker robustness under domain shifts and input perturbations. This work is the first to demonstrate that the purported advantages of instruction tuning are highly sensitive to prompting strategies and can be reversed in domain-shifted scenarios such as MedCalc, where base models surpass tuned variants.
π Abstract
Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67\% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift. Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark. Additionally, instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.