🤖 AI Summary
This work proposes Gecko, a novel architecture designed to address the quadratic computational complexity and poor length extrapolation of standard Transformers in long-sequence modeling. Building upon Mega and Megalodon, Gecko integrates time-step decay normalization, sliding block attention, and adaptive working memory, enabling native handling of ultra-long sequences without requiring context extension techniques. The model stably supports sequences up to 4 million tokens and effectively retrieves information from contexts four times longer than its attention window. Trained with 7B parameters on 2 trillion tokens, Gecko achieves a training loss of 1.68, outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and approaching the performance of Llama2-13B (1.67).
📝 Abstract
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm