Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses a critical limitation in current scientific reasoning models, which often produce definitive answers despite insufficient evidence, neglecting the necessity of abstention. To remedy this, we propose an abstention-aware scientific verification framework that decomposes scientific claims into minimal atomic conditions and evaluates each condition through natural language inference guided by confidence thresholds. The framework dynamically selects among supporting, refuting, or abstaining based on evidence sufficiency. We establish a new evaluation paradigm emphasizing rigorous assessment of evidential adequacy and validate multiple large language models on benchmarks such as SciFact and PubMedQA. Experimental results demonstrate that our approach significantly reduces error rates at moderate coverage levels, underscoring the vital role of abstention mechanisms in enhancing the reliability of scientific reasoning systems.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .

Problem

Research questions and friction points this paper is trying to address.

abstention

scientific reasoning

evidence sufficiency

uncertainty

claim verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

abstention-aware reasoning

scientific claim verification

natural language inference