🤖 AI Summary
Large language models (LLMs) exhibit significant fragility to structural variations in logical reasoning tasks, undermining their reliability and generalizability. Method: We construct three robustness benchmark datasets—ReClor-plus, LogiQA-plus, and LogiQAv2-plus—incorporating systematic perturbations including answer option shuffling, answer substitution, and hybrid disturbances to rigorously assess model stability under structural changes. We further propose (i) logic-driven data augmentation, (ii) task-structure-aware fine-tuning, and (iii) a unified structured prompting framework compatible with both discriminative and generative logical reasoning tasks. Contribution/Results: Our analysis reveals a consistent 20–40% performance drop across mainstream LLMs (e.g., GPT-4, LLaMA) under structural perturbations—a previously undocumented vulnerability. The proposed methods yield an average accuracy improvement of 12.7% on both original and perturbed test sets, substantially enhancing cross-scenario generalization and robustness.
📝 Abstract
Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named"ReClor-plus","LogiQA-plus"and"LogiQAv2-plus"that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by"none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.