🤖 AI Summary
This study investigates whether the comparable performance of Transformers and Conformers in automatic speech recognition stems from similar internal processing strategies. By constructing a controlled suite of 24 pretrained encoders spanning 39M to 3.3B parameters, the authors introduce an “architectural fingerprinting” framework that combines representational probing with fine-grained layer-wise analysis. Their findings reveal a fundamental divergence: Conformers achieve phoneme classification predominantly in early layers (within the first 29%), exhibiting an “early classification” strategy, whereas Transformers defer critical information integration to deeper layers (49–57%), reflecting a “late integration” behavior. This distinction provides a theoretical foundation for architecture selection in applications demanding low latency versus those requiring rich contextual modeling.
📝 Abstract
In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a"Categorize Early"strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers"Integrate Late,"deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers'front-loaded categorization may benefit low-latency streaming, while Transformers'deep integration may favor tasks requiring rich context and cross-utterance normalization.