🤖 AI Summary
Large language models (LLMs) exhibit high sensitivity to minor input perturbations—such as prompt ordering, phrasing, or language switching—leading to inconsistent mathematical reasoning outputs, particularly in smaller and medium-sized models. To address this, we propose a Multidimensional Reasoning Consistency (MRC) framework that, for the first time, formalizes reasoning stability across orthogonal dimensions (ordering, phrasing, and language) as an optimizable robustness signal. Our method leverages prompt engineering to generate multidimensional perturbed samples, then aggregates predictions via consistency voting, evaluated in zero-shot and few-shot settings. We conduct systematic evaluation on the monolingual GSM8K and multilingual MGSM benchmarks. Experiments demonstrate that MRC significantly improves both reasoning stability and accuracy of open-source small-to-medium LLMs, yielding average accuracy gains of 3.2–5.7 percentage points on GSM8K and MGSM. This work establishes a novel paradigm for robust, lightweight mathematical reasoning.
📝 Abstract
While Large language models (LLMs) have proved able to address some complex reasoning tasks, we also know that they are highly sensitive to input variation, which can lead to different solution paths and final answers. Answer consistency across input variations can thus be taken as a sign of stronger confidence. Leveraging this insight, we introduce a framework, {em Multidimensional Reasoning Consistency} where, focusing on math problems, models are systematically pushed to diversify solution paths towards a final answer, thereby testing them for answer consistency across multiple input variations. We induce variations in (i) order of shots in prompt, (ii) problem phrasing, and (iii) languages used. Extensive experiments on a large range of open-source state-of-the-art LLMs of various sizes show that reasoning consistency differs by variation dimension, and that by aggregating consistency across dimensions, our framework consistently enhances mathematical reasoning performance on both monolingual dataset GSM8K and multilingual dataset MGSM, especially for smaller models.