🤖 AI Summary
This work addresses the high computational and memory costs imposed by the quadratic complexity of self-attention mechanisms in speech encoders, which hinder model scalability. To overcome this limitation, the authors propose Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity, designed as a plug-and-play replacement for multi-head self-attention. PoM is integrated into the BEST-RQ framework for self-supervised speech representation learning. Experimental results demonstrate that PoM substantially reduces both computational and memory requirements while achieving word error rates on downstream automatic speech recognition tasks comparable to those of full self-attention and other efficient attention variants, thereby offering a more favorable trade-off between efficiency and performance.
📝 Abstract
State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.