Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the computational inefficiency of Transformers in large-scale sequence modeling, this paper proposes StripedHyena 2—a novel hybrid architecture integrating convolution, attention, and state-space models—featuring input-dependent convolutions and hardware-aware co-optimization. Methodologically, it introduces an overlap-add tiled tensor kernel, context-aware parallelization strategies (all-to-all and point-to-point), and byte-level tokenization to achieve deep alignment between computation and hardware characteristics. Contributions include: (1) end-to-end training throughput 1.2–2.9× faster than the best Transformer at 40B parameters; (2) single-operator throughput on H100 GPUs twice that of linear attention and state-space models; and (3) state-of-the-art performance on context recall and multi-token compression tasks, establishing a new baseline for byte-level long-sequence modeling.

Technology Category

Application Category

📝 Abstract

We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

Problem

Research questions and friction points this paper is trying to address.

Develops convolutional multi-hybrid architectures for efficient language modeling.

Enhances sequence modeling on byte-tokenized data using hybrid models.

Improves training speed and throughput over Transformers and previous hybrids.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convolutional multi-hybrid architectures for token manipulation tasks

Hardware-aware algorithms for efficiency gains over Transformers

Overlap-add blocked kernels and parallelism strategies for throughput

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs