When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

126K/year

🤖 AI Summary

This study addresses the over-conservatism of large language models (LLMs) in OWL 2 DL-compliant querying, where they frequently respond with “unknown” instead of the correct “no” for entailed negative answers under the open-world assumption. The authors investigate how guidance from a reasoner—delivered through four interaction paradigms—can mitigate this excessive caution. Experimental results demonstrate that merely providing the reasoner’s verdict without additional prompting achieves 97.8% accuracy, substantially outperforming prompted (67.2%) and generic retry (81.7%) approaches. Notably, explicit open-world prompts degrade performance, underscoring the necessity of ablation studies in prompt design. Combining OWL 2 DL reasoner auditing, multi-turn interaction, open-world modeling, and statistical validation via McNemar’s test with Bonferroni correction, the work reveals that the efficacy of reasoner guidance is highly sensitive to its presentation format.

Technology Category

Application Category

📝 Abstract

We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers ``unknown'' when the reasoner-entailed answer is ``no'' under \emph{FunctionalProperty} closure or class \emph{disjointness}. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong'' retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9\,\% (Wilson 95\,\% CI $[36.8,51.2]$); generic retry reaches 81.7\,\% ($[75.4,86.6]$); the verdict-with-hint variant is \emph{worse} at 67.2\,\% ($[60.1,73.7]$); the verdict-only variant reaches 97.8\,\% ($[94.4,99.1]$). All pairwise comparisons remain significant under McNemar's exact test with Bonferroni correction ($α= 0.01$; all $p < 10^{-5}$). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.

Problem

Research questions and friction points this paper is trying to address.

LLM overcaution

entailed negations

OWL 2 DL

reasoner-guided repair

open-world assumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoner-guided repair

overcaution

entailed negation