Automatically Finding Rule-Based Neurons in OthelloGPT

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This paper addresses the problem of automatic identification and interpretability modeling of rule-based MLP neurons in OthelloGPT. We propose a neuron behavior decomposition method based on regression decision trees, modeling neuron activation as piecewise logical rules over board states—thereby mapping black-box activations to human-readable strategic patterns. Our approach innovatively integrates decision tree fitting, causal intervention, and ablation analysis to rigorously verify the causal role of individual neurons in predicting specific moves. Experiments show that 913 out of 2,048 neurons (44.6%) in layer 5 exhibit high-fidelity rule-based approximations (R² > 0.7). Targeted ablation of key neurons degrades corresponding move prediction accuracy by 5–10×, confirming their functional specificity and interpretability.

Technology Category

Application Category

📝 Abstract

OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ($R^2 > 0.7$ for 913 of 2,048 neurons), while the remainder likely participate in more distributed or non-rule-based computations. We verify the causal relevance of patterns identified by our decision trees through targeted interventions. For a specific square, for specific game patterns, we ablate neurons corresponding to those patterns and find an approximately 5-10 fold stronger degradation in the model's ability to predict legal moves along those patterns compared to control patterns. To facilitate future work, we provide a Python tool that maps rule-based game behaviors to their implementing neurons, serving as a resource for researchers to test whether their interpretability methods recover meaningful computational structures.

Problem

Research questions and friction points this paper is trying to address.

Automatically identifying rule-based neurons in OthelloGPT using decision trees

Converting neuron activations into human-readable logical game rules

Verifying causal relevance of discovered patterns through targeted neuron ablation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated decision trees map board states to neuron activations

Extract decision paths converting neurons to logical forms

Causal verification through targeted neuron ablation interventions

🔎 Similar Papers

No similar papers found.