🤖 AI Summary
This work addresses the lack of operationalizability in existing definitions of interpretability, which hinders their utility in guiding model design and reasoning. It proposes the first formalization of interpretability as a symmetry problem, deriving properties and categories of interpretable models from four fundamental symmetry classes. Building on this foundation, the paper constructs a unified Bayesian inversion framework that naturally integrates core reasoning tasks—such as alignment, intervention, and counterfactual inference—into a coherent structure. This approach establishes the first symmetry-based, operationally grounded theory of interpretability, offering a rigorous formal foundation for reasoning in artificial intelligence systems.
📝 Abstract
This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.