🤖 AI Summary
While linear directions can capture concepts in language models, they struggle to represent the underlying relational structure. This work introduces a Tensor Product Representation (TPR) probing framework within the structured environment of Othello, decomposing board states into position embeddings, color embeddings, and their binding matrices. The authors demonstrate, for the first time, that conventional linear probes can be directly reconstructed from TPR parameters. Experimental results reveal that the reconstructed weights align closely with the geometric structure of the game board, indicating that linear directions are projections of deeper, structured internal representations. By successfully recovering the model’s structured state representations and establishing a rigorous mathematical connection between linear and TPR probes, this study offers a novel perspective on the internal mechanisms of language models.
📝 Abstract
While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.