METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current evaluations of causal reasoning in large language models lack contextual consistency and fail to span all levels of the causal hierarchy. This work proposes the first benchmark framework that systematically assesses model performance across the three rungs of the causal ladder—association, intervention, and counterfactuals—within a unified contextual setting. By integrating error pattern analysis with internal information flow tracing, the study uncovers mechanisms underlying performance degradation. Findings reveal that models are susceptible to interference from factually correct but causally irrelevant information at lower levels and exhibit markedly reduced fidelity to context in higher-level tasks, resulting in a consistent decline in performance as causal complexity increases. This research establishes a new benchmark and offers mechanistic insights for understanding and improving causal reasoning capabilities in large language models.

Technology Category

Application Category

📝 Abstract

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

Problem

Research questions and friction points this paper is trying to address.

contextual causal reasoning

large language models

causal hierarchy

benchmarking

context consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

large language models

benchmarking