🤖 AI Summary
This paper addresses the problem of automatic identification and interpretability modeling of rule-based MLP neurons in OthelloGPT. We propose a neuron behavior decomposition method based on regression decision trees, modeling neuron activation as piecewise logical rules over board states—thereby mapping black-box activations to human-readable strategic patterns. Our approach innovatively integrates decision tree fitting, causal intervention, and ablation analysis to rigorously verify the causal role of individual neurons in predicting specific moves. Experiments show that 913 out of 2,048 neurons (44.6%) in layer 5 exhibit high-fidelity rule-based approximations (R² > 0.7). Targeted ablation of key neurons degrades corresponding move prediction accuracy by 5–10×, confirming their functional specificity and interpretability.
📝 Abstract
OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ($R^2 > 0.7$ for 913 of 2,048 neurons), while the remainder likely participate in more distributed or non-rule-based computations. We verify the causal relevance of patterns identified by our decision trees through targeted interventions. For a specific square, for specific game patterns, we ablate neurons corresponding to those patterns and find an approximately 5-10 fold stronger degradation in the model's ability to predict legal moves along those patterns compared to control patterns. To facilitate future work, we provide a Python tool that maps rule-based game behaviors to their implementing neurons, serving as a resource for researchers to test whether their interpretability methods recover meaningful computational structures.