🤖 AI Summary
This work investigates the learning mechanisms and performance optimization of single-hidden-layer neural networks for binary operations over finite groups—particularly the symmetric group $S_5$.
Method: Leveraging mechanistic interpretability analysis and equivariant modeling, we identify and formally verify an approximate equivariance structure with respect to input permutations—a property previously unverified in this context. This verified structure is then distilled into a compact correctness proof.
Contribution/Results: We derive the first nontrivial, verifiable lower bound on classification accuracy—guaranteeing ≥95% accuracy for 45% of trained models. On $S_5$ tasks, our method accelerates formal verification by 3× compared to brute-force approaches, enabling efficient accuracy certification. The framework unifies disparate interpretability studies and establishes a verifiable, generalizable theoretical foundation for learning group-structured functions.
📝 Abstract
A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models in a step towards unifying the explanations of previous works (Chughtai et al., 2023; Stander et al., 2024). Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of the extent to which we faithfully and concisely explain model internals. In the main text, we focus on the symmetric group S5. For models trained on this group, our explanation yields a guarantee of model accuracy that runs 3x faster than brute force and gives a>=95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.