π€ AI Summary
Existing NLP methods rely heavily on lexical cues to identify event causality, exhibiting poor out-of-distribution generalization. To address this, we introduce ACCESSβthe first benchmark for causal discovery and reasoning at the *abstract level* of everyday events. Built upon GLUCOSE, ACCESS comprises 1,400 abstract causal pairs explicitly stripped of concrete surface realizations to emphasize generalizable causal modeling. We formally define and implement the first end-to-end pipeline for automatic identification, extraction, and verification of abstract causal events, integrating statistical models with large language models (LLMs) for abstraction assessment and causal inference. Experiments reveal substantial limitations of both current LLMs and statistical approaches in abstract causal discovery. Empirically, ACCESS significantly improves LLM performance on causal question-answering tasks, demonstrating its utility in advancing causal knowledge toward human-level generalization.
π Abstract
Identifying cause-and-effect relationships is critical to understanding real-world dynamics and ultimately causal reasoning. Existing methods for identifying event causality in NLP, including those based on Large Language Models (LLMs), exhibit difficulties in out-of-distribution settings due to the limited scale and heavy reliance on lexical cues within available benchmarks. Modern benchmarks, inspired by probabilistic causal inference, have attempted to construct causal graphs of events as a robust representation of causal knowledge, where exttt{CRAB} citep{romanou2023crab} is one such recent benchmark along this line. In this paper, we introduce exttt{ACCESS}, a benchmark designed for discovery and reasoning over abstract causal events. Unlike existing resources, exttt{ACCESS} focuses on causality of everyday life events on the abstraction level. We propose a pipeline for identifying abstractions for event generalizations from exttt{GLUCOSE} citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit commonsense causal knowledge, from which we subsequently extract $1,4$K causal pairs. Our experiments highlight the ongoing challenges of using statistical methods and/or LLMs for automatic abstraction identification and causal discovery in NLP. Nonetheless, we demonstrate that the abstract causal knowledge provided in exttt{ACCESS} can be leveraged for enhancing QA reasoning performance in LLMs.