Olmo Hybrid: From Theory to Practice and Back

πŸ“… 2026-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates whether hybrid architectures can surpass pure Transformer models in expressive power and performance while maintaining scalability. We propose a hybrid language model that integrates a linear recurrent network (Gated DeltaNet) with attention mechanisms and demonstrate, for the first time at the 7B parameter scale, its empirical superiority over the comparable pure Transformer model, OLMo-3. Through theoretical analysis and large-scale pretraining experiments, we show that this hybrid architecture not only exhibits enhanced sequence modeling capacity but also achieves higher scaling efficiency. The model consistently outperforms the baseline across standard pretraining metrics and downstream task evaluations.
πŸ“ Abstract
Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
Problem

Research questions and friction points this paper is trying to address.

hybrid models
language modeling
scaling efficiency
expressivity
transformer alternatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid language models
linear RNNs
Gated DeltaNet
scaling efficiency
model expressivity
πŸ”Ž Similar Papers
No similar papers found.