🤖 AI Summary
Cross-attention is widely adopted as a proxy for interpretability in speech-to-text (S2T) models, yet its fidelity in reflecting true input-output dependencies remains underexplored. Method: We conduct a systematic, cross-model evaluation across diverse S2T architectures—including monolingual/multilingual, single/multi-task, and varying-scale models—by quantitatively comparing multi-head, multi-layer cross-attention scores against feature-attribution-based input saliency maps. Contribution/Results: We find that cross-attention captures only ~50% of ground-truth input-output dependencies; although attention scores correlate moderately to strongly with saliency (up to *r* = 0.75), they explain at most 75% of decoder attention behavior, exposing intrinsic limitations. Critically, we provide the first empirical evidence that neither individual attention heads nor single layers suffice for reliable interpretation. Instead, aggregating multi-head, multi-layer cross-attention significantly improves explanation consistency and robustness—establishing a methodological foundation and practical guidelines for trustworthy interpretability in S2T models.
📝 Abstract
Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder's representations--accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.