🤖 AI Summary
Explaining differences between two populations in high-dimensional data remains challenging due to the lack of interpretable, actionable attributions.
Method: This paper proposes ExDis—the first automated framework integrating causal inference into difference explanation. ExDis jointly performs subgroup discovery and causal effect estimation to precisely identify subregions where differences are statistically significant or reversed, and isolates features with genuine causal influence on those differences. It combines rigorous causal inference, scalable subgroup optimization, and high-dimensional feature selection.
Results: Evaluated on three real-world datasets, ExDis outperforms existing methods in both explanation accuracy and interpretability, while demonstrating strong scalability. Its causal explanations are not only human-understandable but also operationally meaningful—enabling data-driven decision-making grounded in deep, mechanistic attribution.
📝 Abstract
During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.