ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited capability of large language models (LLMs) in explicit causal reasoning. To this end, we introduce ExpliCa—the first multilingual benchmark explicitly designed to evaluate understanding of explicit causal and temporal relations, featuring connective-driven causal/temporal sequence pairs and crowdsourced acceptability annotations. We present the first systematic disentanglement of causal and temporal reasoning, revealing differential impacts of linguistic ordering and model scale on each. Our evaluation framework combines prompt engineering with perplexity-based assessment, validated via human annotation. Evaluating seven representative LLMs on ExpliCa, we observe a maximum accuracy of only 0.79, confirming a substantial bottleneck in explicit causal reasoning. Key contributions include: (1) the first multilingual benchmark for explicit causal reasoning; (2) a causal–temporal disentanglement methodology; and (3) a reproducible dual-track evaluation paradigm integrating both prompting and perplexity metrics.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.
Problem

Research questions and friction points this paper is trying to address.

Evaluating explicit causal reasoning in LLMs
Challenges in distinguishing causal from temporal relations
Impact of linguistic order on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with causal-temporal relations
Crowdsourced human acceptability ratings
Prompting and perplexity-based metrics
🔎 Similar Papers
M
Martina Miliani
CoLing Lab, Department of Philology, Literature, and Linguistics, University of Pisa, Italy
S
Serena Auriemma
CoLing Lab, Department of Philology, Literature, and Linguistics, University of Pisa, Italy
A
Alessandro Bondielli
CoLing Lab, Department of Philology, Literature, and Linguistics, University of Pisa, Italy; Department of Informatics, University of Pisa, Italy
Emmanuele Chersoni
Emmanuele Chersoni
Hong Kong Polytechnic University
Computational Linguistics
L
Lucia Passaro
Department of Informatics, University of Pisa, Italy
I
Irene Sucameli
A
Alessandro Lenci
CoLing Lab, Department of Philology, Literature, and Linguistics, University of Pisa, Italy