🤖 AI Summary
This work investigates the generalization capability of large language models (LLMs) under structural perturbations in logical reasoning tasks. Method: We propose a controllable stress-testing framework that systematically evaluates four perturbation types: deletion of critical/redundant rules, injection of contradictory evidence, logic-preserving rewrites (e.g., contraposition, De Morgan’s laws), and multi-rule equivalent transformations (2–5 simultaneous inference steps). Experiments span BERT, Qwen2, and LLaMA series models. Contribution/Results: Models exhibit near-perfect robustness to semantic-equivalent rewrites (≈100% accuracy) but suffer sharp performance drops—down to 25%—under critical rule removal and collapse to 0% under contradiction injection. Remarkably, they remain robust under multi-step logical transformations, indicating strong dependence on evidential completeness rather than formal complexity. This study is the first to disentangle and quantify two primary failure modes in LLM logical reasoning: sensitivity to missing evidence and lack of contradiction tolerance—establishing a novel benchmark for trustworthy reasoning evaluation.
📝 Abstract
Large language models (LLMs) excel across many natural language tasks, yet their generalisation to structural perturbations in logical contexts remains poorly understood. We introduce a controlled evaluation framework that probes reasoning reliability through four targeted stress tests: (1) rule deletion, removing either redundant or essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites generated through several families of equivalence laws (contrapositive, double negation, implication, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that introduces 2-5 simultaneous logical transformations.
Across three representative model families: BERT, Qwen2, and LLaMA-like models. Our experiments reveal a strikingly consistent pattern: all models achieve perfect accuracy on the base tasks and remain fully generalise to redundant rule deletion and all equivalence-based rewrites (single or multi-law), but fail sharply under essential rule deletion (dropping to 25% accuracy) and collapse completely in the presence of explicit contradictions (0% accuracy). These results demonstrate that LLMs possess stable invariance to semantic-preserving logical transformations, yet remain fundamentally brittle to missing or conflicting evidence. Our framework provides a clean diagnostic tool for isolating such reasoning failure modes and highlights persistent gaps in the logical generalisation abilities of current LLMs.