Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Understanding the internal mechanisms of model-free reinforcement learning (RL) policies—particularly convolutional recurrent neural networks (RNNs)—remains challenging due to their opaque, end-to-end nature. Method: We analyze a model-free RL-trained convolutional RNN on the Sokoban puzzle task using activation-space analysis, directional channel decoding, forward/backward kernel visualization, and test-time computational expansion. Contribution/Results: We discover that the network implicitly implements a bi-directional search structure: a distributed transition model, hierarchical value function, box-wise state representation, and directional inter-layer activations enabling deep search and dynamic backtracking. This is the first identification of formally interpretable, search-semantically grounded components in a model-free RL-trained RNN. Crucially, activation magnitudes directly govern pruning and backtracking behavior, yielding substantial improvements in solution rates. Our findings demonstrate that black-box RL policies can be systematically mapped onto structured search algorithms, establishing a novel paradigm for RL interpretability.

Technology Category

Application Category

📝 Abstract
We partially reverse-engineer a convolutional recurrent neural network (RNN) trained to play the puzzle game Sokoban with model-free reinforcement learning. Prior work found that this network solves more levels with more test-time compute. Our analysis reveals several mechanisms analogous to components of classic bidirectional search. For each square, the RNN represents its plan in the activations of channels associated with specific directions. These state-action activations are analogous to a value function - their magnitudes determine when to backtrack and which plan branch survives pruning. Specialized kernels extend these activations (containing plan and value) forward and backward to create paths, forming a transition model. The algorithm is also unlike classical search in some ways. State representation is not unified; instead, the network considers each box separately. Each layer has its own plan representation and value function, increasing search depth. Far from being inscrutable, the mechanisms leveraging test-time compute learned in this network by model-free training can be understood in familiar terms.
Problem

Research questions and friction points this paper is trying to address.

Reverse-engineer RNN's decision-making in Sokoban
Analyze directional activations as value functions
Understand transition model formation in neural networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

RNN represents plans via directional activations
Specialized kernels extend activations for paths
Each layer has unique plan and value functions
🔎 Similar Papers
No similar papers found.
Mohammad Taufeeque
Mohammad Taufeeque
Research Engineer at FAR AI
AI SafetyMachine Learning
A
Aaron David Tucker
FAR.AI, Berkeley, California, United States of America
A
A. Gleave
FAR.AI, Berkeley, California, United States of America
Adrià Garriga-Alonso
Adrià Garriga-Alonso
Research Scientist, FAR AI
AI safetyinterpretability