🤖 AI Summary
Transformer-based embedding models face significant challenges in computational complexity and memory consumption when processing long texts. This work proposes a general vertical chunking inference method tailored for recurrent language models such as Mamba2, RWKV, and xLSTM, achieving linear time complexity and constant memory usage when input sequences exceed the chunk size. Combined with a fine-tuning strategy, the approach attains performance on par with Transformers across multiple embedding benchmarks while substantially reducing memory overhead. These results demonstrate the effectiveness and competitiveness of recurrent architectures for efficient text embedding generation.
📝 Abstract
Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked inference strategy that enables fast embedding generation with memory usage that becomes constant in the input length once it exceeds the vertical chunk size. By fine-tuning Mamba2 models, we demonstrate their viability as general-purpose text embedders, achieving competitive performance across a range of benchmarks while maintaining a substantially smaller memory footprint compared to transformer-based counterparts. We empirically validate the applicability of our inference strategy to Mamba2, RWKV, and xLSTM models, confirming consistent runtime-memory trade-offs across architectures and establishing recurrent models as a compelling alternative to transformers for efficient embedding generation.