🤖 AI Summary
This work addresses the challenge of constructing efficient bidirectional encoders for resource-constrained industrial settings by reconfiguring the attention-free Avey model into a pure encoder architecture. It introduces three key innovations: decoupling static and dynamic parameters, a stability-oriented normalization strategy, and neural compression techniques. The proposed approach achieves high-performance bidirectional contextual modeling without attention mechanisms for the first time, consistently outperforming four mainstream Transformer-based encoders on standard token classification and information retrieval benchmarks. Furthermore, it demonstrates superior scaling efficiency and computational efficacy in long-context tasks, offering a compelling alternative to conventional attention-based architectures in scenarios where computational resources are limited.
📝 Abstract
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.