LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the lack of reliable, low-cost benchmarks for evaluating the faithfulness of concept-based explanations to large language model (LLM) behavior, noting that manually constructed counterfactuals are expensive and prone to bias. To this end, we propose LIBERTy, the first benchmark framework that leverages structural causal models (SCMs) to automatically generate counterfactual pairs by applying causal interventions on concepts and using LLMs to produce structured counterfactual data. We introduce a novel metric, order-faithfulness, and conduct empirical analyses across three domains and five models, revealing substantial room for improvement in current explanation methods. Our findings also indicate that proprietary LLMs exhibit markedly reduced sensitivity to demographic concepts compared to their open counterparts.

Technology Category

Application Category

📝 Abstract

Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.

Problem

Research questions and friction points this paper is trying to address.

concept-based explanations

counterfactuals

faithfulness

benchmarking

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

structural counterfactuals

concept-based explanations

structured causal models