Extracting Finite State Machines from Transformers

📅 2024-10-08
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the mechanistic interpretability of Transformer models in learning regular languages, proposing reverse extraction into Moore machines to achieve high-level behavioral abstraction. Methodologically, it introduces the first adaptation of the L* active learning algorithm for extracting state machines from Transformers, integrated with state clustering and attention-head attribution analysis to yield verifiable Moore machine models. The contributions are threefold: (1) establishing a tighter lower bound on the learnability of regular languages by Transformers; (2) precisely characterizing a class of regular languages that admit strong generalization under single-layer Transformers; and (3) identifying and formalizing a novel systematic failure mode—“attention saturation-induced symbol misidentification”—which fundamentally constrains length generalization. Collectively, these results advance both the theoretical understanding of Transformer limitations in symbolic reasoning and the development of rigorous, interpretable abstractions for neural sequence models.

Technology Category

Application Category

📝 Abstract
Fueled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn. Nonetheless, existing results remain hard to compare and a fine-grained understanding of the trainability of transformers on regular languages is still lacking. We investigate transformers trained on regular languages from a mechanistic interpretability perspective. Using an extension of the $L^*$ algorithm, we extract Moore machines from transformers. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state. Additionally, our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation. However, we also identify failure cases where the determining symbols get misrecognised due to saturation of the attention mechanism.
Problem

Research questions and friction points this paper is trying to address.

Extracting finite state machines from trained transformers
Abstracting transformers with Moore machine representations
Analyzing transformer capabilities on regular language learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracting Moore Machines from trained Transformers
Using queries and counterexamples for abstraction
Mapping training tasks to finite state automata
🔎 Similar Papers
No similar papers found.
R
Rik Adriaensen
KU Leuven, Department of Computer Science, Leuven, Belgium
Jaron Maene
Jaron Maene
KU Leuven
neurosymbolic AIprobabilistic programming