🤖 AI Summary
This work addresses the limited capacity of existing large language model (LLM) evaluation benchmarks to assess entity-based causal commonsense reasoning, particularly the lack of explicit evaluation of abductive reasoning and explanation generation. To bridge this gap, the authors introduce CommonWhy, a novel benchmark comprising 15,000 “why” questions constructed from Wikidata, which uniquely integrates causal commonsense reasoning into knowledge graph question answering (KGQA), moving beyond conventional fact-retrieval paradigms. CommonWhy requires models to generate explanatory rationales grounded in entity-centric knowledge. Experimental results demonstrate that state-of-the-art LLMs and KGQA methods exhibit significant shortcomings on this benchmark, frequently producing factual hallucinations and failing in causal inference, thereby underscoring the challenge and evaluative value of CommonWhy.
📝 Abstract
To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.