DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Current natural language inference benchmarks inadequately assess models’ capacity to revise beliefs in response to minimal premise perturbations, thus failing to capture dynamic reasoning behavior. This work proposes DeltaLogic, a benchmark transformation protocol that reformulates inference instances into short sequential tasks comprising an initial inference, a minimally edited premise, and a belief revision judgment, enabling the first systematic evaluation of how models update beliefs under localized evidence changes. Building revised task sets from FOLIO and ProofWriter and introducing a constrained-label scoring mechanism, experiments on mainstream causal language models—including Qwen and Phi—reveal a pronounced disconnect between strong initial reasoning performance and robust belief updating. Most models exhibit either inertial bias or excessive abstention; while Phi-4-mini-instruct demonstrates relatively greater stability, it remains inconsistent, underscoring a critical gap in dynamic reasoning evaluation.

Technology Category

Application Category

📝 Abstract

Reasoning benchmarks typically evaluate whether a model derives the correct answer from a fixed premise set, but they under-measure a closely related capability that matters in dynamic environments: belief revision under minimal evidence change. We introduce DeltaLogic, a benchmark transformation protocol that converts natural-language reasoning examples into short revision episodes. Each episode first asks for an initial conclusion under premises P, then applies a minimal edit δ(P), and finally asks whether the previous conclusion should remain stable or be revised. We instantiate DeltaLogic from FOLIO and ProofWriter and evaluate small causal language models with constrained label scoring. On a completed 30-episode Qwen evaluation subset, stronger initial reasoning still does not imply stronger revision behavior: Qwen3-1.7B reaches 0.667 initial accuracy but only 0.467 revision accuracy, with inertia rising to 0.600 on episodes where the gold label should change, while Qwen3-0.6B collapses into near universal abstention. There, Qwen3-4B preserves the same inertial failure pattern (0.650 initial, 0.450 revised, 0.600 inertia), whereas Phi-4-mini-instruct is substantially stronger (0.950 initial, 0.850 revised) but still exhibits non-trivial abstention and control instability. These results suggest that logical competence under fixed premises does not imply disciplined belief revision after local evidence edits. DeltaLogic therefore targets a distinct and practically important reasoning capability that complements existing logical inference and belief-updating benchmarks.

Problem

Research questions and friction points this paper is trying to address.

belief revision

logical reasoning

minimal premise edits

reasoning benchmarks

evidence change

Innovation

Methods, ideas, or system contributions that make the work stand out.

belief revision

minimal premise edits

DeltaLogic