🤖 AI Summary
Neural networks’ subsymbolic semantics hinder causal attribution and interpretation of their implicit task representations. To address this, we propose a probabilistic attribution framework based on Bayesian ablation—marking the first integration of Bayesian inference with information theory to quantify the causal contribution of individual neural units to task performance. Methodologically, we define and estimate three computable representation metrics: distributional dispersion, manifold complexity, and polysemy—grounded in mutual information, effective dimensionality, and probabilistic sensitivity, respectively—to systematically characterize the mapping between representation structure and task semantics. Experiments on multi-task models demonstrate that our metrics exhibit strong correlation with generalization performance, while significantly improving both representation interpretability and attribution reliability over existing approaches.
📝 Abstract
Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.