Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

📅 2024-07-18

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Neural network interpretability lacks formal foundations and verifiability. Method: This paper introduces abstract interpretation—a rigorous program analysis framework—into Transformer interpretability research for the first time, establishing the first axiomatic framework for defining and verifying compositional, approximate semantic characterizations of model computations. It integrates circuit-level neuron analysis, attention pattern tracing, and logical trajectory reconstruction to reverse-engineer mechanistic behavior, and formally proves that the resulting explanations satisfy all axioms. Contribution/Results: The approach successfully reconstructs the complete stepwise 2-SAT solving algorithm implemented by a Transformer—including syntactic parsing and variable enumeration/evaluation—demonstrating the first verified white-box reconstruction from black-box behavior. This work provides a verifiable, reproducible theoretical foundation and technical methodology for mechanistic interpretability.

Technology Category

Application Category

📝 Abstract

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We use these axioms to guide the mechanistic interpretability analysis of a Transformer-based model trained to solve the well-known 2-SAT problem. We are able to reverse engineer the algorithm learned by the model -- the model first parses the input formulas and then evaluates their satisfiability via enumeration of different possible valuations of the Boolean input variables. We also present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.

Problem

Research questions and friction points this paper is trying to address.

Formally defining mechanistic interpretation axioms for neural networks

Validating interpretations via compositional semantic approximation

Testing axioms on Transformer models solving 2-SAT problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Axiomatic approach for mechanistic interpretation validation

Compositional semantics for neural network analysis

Transformer-based model for 2-SAT problem solving

🔎 Similar Papers

Can Transformers Reason Logically? A Study in SAT Solving