🤖 AI Summary
This work challenges the prevailing assumption that performance gains of large language models (LLMs) on GSM8k stem from intrinsic improvements in mathematical reasoning capability, arguing instead that expanded pretraining data coverage is the primary driver. To address the generalization bottleneck—particularly for small-data or weakly trained models—the authors introduce discourse structure as a novel, lightweight, plug-and-play supervisory signal for mathematical reasoning. Methodologically, they construct structured prompts grounded in discourse analysis, integrate them with instruction tuning, and explicitly annotate reasoning paths. Evaluated on open-source models including Llama2-13b, the approach achieves a 160% accuracy improvement on GSM8k. It also significantly enhances out-of-distribution (OOD) robustness and yields consistent gains even on strongly overfitted models. The core contribution is a new, minimal-invasive reasoning augmentation paradigm grounded in discourse structure—offering an effective, modular, and broadly applicable alternative to conventional reasoning supervision.
📝 Abstract
We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al. (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.