xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Understanding the scaling behavior of extended Long Short-Term Memory (xLSTM) architectures in large language models—particularly regarding computational efficiency, context-length adaptability, and inference latency—remains underexplored compared to Transformer baselines. Method: We conduct controlled, cross-scale experiments across an 80M–7B parameter range using IsoFLOP normalization and parametric scaling law fitting to isolate architectural effects. Contribution/Results: This work provides the first systematic characterization of xLSTM scaling under both compute-optimal and over-trained regimes. We find that xLSTM achieves Transformer-level benchmark performance at the billion-parameter scale; its advantage in modeling long contexts grows markedly with sequence length; and it exhibits superior scaling laws in both training and inference efficiency—especially in high-throughput, long-context deployment scenarios. These findings establish xLSTM as a competitive alternative to Transformers for memory- and latency-sensitive applications requiring extended context handling.

Technology Category

Application Category

📝 Abstract
Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM's advantage widens as training and inference contexts grow.
Problem

Research questions and friction points this paper is trying to address.

Comparing scaling behavior of Transformers and xLSTM architectures
Analyzing optimal model size dependence on context length
Investigating inference-time scaling characteristics for language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

xLSTM achieves linear time-complexity scaling
xLSTM maintains competitive performance at scale
xLSTM outperforms Transformers with longer contexts
🔎 Similar Papers
No similar papers found.