Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

📅 2023-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant fragility to structural variations in logical reasoning tasks, undermining their reliability and generalizability. Method: We construct three robustness benchmark datasets—ReClor-plus, LogiQA-plus, and LogiQAv2-plus—incorporating systematic perturbations including answer option shuffling, answer substitution, and hybrid disturbances to rigorously assess model stability under structural changes. We further propose (i) logic-driven data augmentation, (ii) task-structure-aware fine-tuning, and (iii) a unified structured prompting framework compatible with both discriminative and generative logical reasoning tasks. Contribution/Results: Our analysis reveals a consistent 20–40% performance drop across mainstream LLMs (e.g., GPT-4, LLaMA) under structural perturbations—a previously undocumented vulnerability. The proposed methods yield an average accuracy improvement of 12.7% on both original and perturbed test sets, substantially enhancing cross-scenario generalization and robustness.
📝 Abstract
Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named"ReClor-plus","LogiQA-plus"and"LogiQAv2-plus"that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by"none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Complex Logical Reasoning
Stability and Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complex Test Datasets
Logical Reasoning Enhancement
Task Variation and Logic-based Data Augmentation
🔎 Similar Papers
No similar papers found.
Qiming Bao
Qiming Bao
University of Auckland
Artificial IntelligenceNatural Language ProcessingLLMsReasoningNeurosymbolic AI
G
Gaël Gendron
Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland
A
A. Peng
Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland
Wanjun Zhong
Wanjun Zhong
Bytedance Seed Research
NLP
N
N. Tan
Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland
Y
Yang Chen
Strong AI Lab, NAOInstitute, Waipapa Taumata Rau - The University of Auckland
Michael Witbrock
Michael Witbrock
Professor of Computer Science, Waipapa Taumata Rau: The University of Auckland
Artificial IntelligenceReasoningDeep LearningRepresentation LearningNatural Language Understanding
Jiamou Liu
Jiamou Liu
The University of Auckland
Social NetworksArtificial IntelligenceMachine Learning