🤖 AI Summary
Existing scientific natural language inference (NLI) datasets exhibit severe domain bias toward computer science, lacking adequate coverage of non-CS disciplines such as psychology, engineering, and public health—hindering interdisciplinary scientific reasoning research. To address this gap, we introduce MISMATCHED, the first interdisciplinary scientific NLI benchmark, comprising 2,700 manually annotated sentence pairs that systematically cover underrepresented domains. We propose an implicit scientific NLI relation enhancement training strategy and construct strong baselines by synergistically leveraging pretrained small language models (SLMs) and large language models (LLMs). Experiments reveal that even state-of-the-art models achieve only 78.17% Macro F1, underscoring the task’s substantial difficulty and room for improvement. The dataset and code are fully open-sourced, establishing a critical infrastructure for advancing generalizability and reproducibility in scientific reasoning research.
📝 Abstract
Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.