Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Gecko, a novel architecture designed to address the quadratic computational complexity and poor length extrapolation of standard Transformers in long-sequence modeling. Building upon Mega and Megalodon, Gecko integrates time-step decay normalization, sliding block attention, and adaptive working memory, enabling native handling of ultra-long sequences without requiring context extension techniques. The model stably supports sequences up to 4 million tokens and effectively retrieves information from contexts four times longer than its attention window. Trained with 7B parameters on 2 trillion tokens, Gecko achieves a training loss of 1.68, outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and approaching the performance of Llama2-13B (1.67).

Technology Category

Application Category

📝 Abstract
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm
Problem

Research questions and friction points this paper is trying to address.

sequence modeling
arbitrary length sequences
long-context processing
neural architecture
length extrapolation
Innovation

Methods, ideas, or system contributions that make the work stand out.

sliding chunk attention
timestep decay normalization
adaptive working memory
long-context modeling
sequence modeling
🔎 Similar Papers
No similar papers found.