π€ AI Summary
Standard FFT-based convolution struggles to simultaneously respect sequence boundaries and achieve hardware efficiency when processing packed sequence data, preventing its theoretical advantages from translating into practical training gains. To address this challenge, this work proposes RubiConvβa boundary-aware, hardware-efficient convolution algorithm that, for the first time, enables efficient FFT-based convolution on packed sequences while preserving sequence boundaries. By bridging the gap between theoretical computational complexity and real-world training performance, RubiConv consistently outperforms both conventional FFT convolution and attention mechanisms across multiple large-scale experiments, achieving substantially faster training speeds and higher model efficiency.
π Abstract
Convolutional architectures have emerged as powerful alternatives to Transformers for sequence modeling. The primary advantage is that they offer improved theoretical sequence length complexity by leveraging the Fast Fourier Transform (FFT). However, this theoretical improvement does not always meaningfully land in practice. One critical obstacle is that applying standard FFTs is not amenable to the large-scale training pipeline wherein data is packed from different sources into a single sequence for hardware efficiency. Indeed, standard FFT algorithms are not easily amenable to document packing. Existing workarounds suffer from severe inefficiencies, crippling the practical performance of convolutional architectures. We close this gap with RubiConv, a novel algorithm for performing hardware-efficient, boundary-respecting convolutions on packed sequences. Extensive experiments show that RubiConv achieves significant speedups over both attention and standard FFT-based baselines. This work makes the theoretical efficiency of long convolutional models a practical reality for large-scale, real-world data packing.