A MISMATCHED Benchmark for Scientific Natural Language Inference

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing scientific natural language inference (NLI) datasets exhibit severe domain bias toward computer science, lacking adequate coverage of non-CS disciplines such as psychology, engineering, and public health—hindering interdisciplinary scientific reasoning research. To address this gap, we introduce MISMATCHED, the first interdisciplinary scientific NLI benchmark, comprising 2,700 manually annotated sentence pairs that systematically cover underrepresented domains. We propose an implicit scientific NLI relation enhancement training strategy and construct strong baselines by synergistically leveraging pretrained small language models (SLMs) and large language models (LLMs). Experiments reveal that even state-of-the-art models achieve only 78.17% Macro F1, underscoring the task’s substantial difficulty and room for improvement. The dataset and code are fully open-sourced, establishing a critical infrastructure for advancing generalizability and reproducibility in scientific reasoning research.

Technology Category

Application Category

📝 Abstract

Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Existing NLI datasets lack non-CS domain coverage

Need for benchmark in PSYCHOLOGY, ENGINEERING, PUBLIC HEALTH

Current models show limited performance (78.17% F1)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MISMATCHED benchmark for non-CS domains

Uses SLMs and LLMs for baseline performance

Incorporates implicit NLI relations in training

🔎 Similar Papers

No similar papers found.

Authors to Follow