๐ค AI Summary
This work addresses the lack of a verifiable, comparable, and composable mechanistic framework for evaluating neural model interpretability. It proposes the first formal framework grounded in compositionality and the Minimum Description Length principle, leveraging category theory to decompose models into syntaxโsemantics mapping pairs. Fidelity of explanations is ensured through commutative diagram consistency constraints. Within this framework, interpretability is formulated as an optimization problem balancing fidelity against complexity, and a compression-based distillation algorithm is introduced to systematically simplify model structure without altering its functional behavior. The approach unifies existing mechanistic explanation methods as special cases and demonstrates that, under a parsimony criterion, syntactic compression yields more concise and cognitively plausible explanations, thereby establishing a theoretical foundation for the quantitative evaluation and automated discovery of interpretability.
๐ Abstract
Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.