ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Nonlinear RNNs suffer from inherent sequential dependencies that impede parallelization, limiting their scalability in large language models; meanwhile, existing parallel architectures—such as Transformers and state space models (SSMs)—either incur high computational overhead or rely on linear assumptions, hindering effective modeling of strongly nonlinear sequential dynamics. This paper introduces the first efficiently parallelizable nonlinear RNN framework: it reformulates recursive forward propagation as an implicit system of equations and combines Newton’s method with a customized parallel reduction algorithm to enable fully automatic parallelization of standard nonlinear RNNs (e.g., LSTM, GRU). Departing from conventional unrolling-based training, our approach preserves full nonlinear expressivity while enabling scalable distributed training. Experiments demonstrate up to 665× speedup over standard RNN implementations, successful training of a 7B-parameter model, and perplexity competitive with similarly sized Transformers and Mamba2.

Technology Category

Application Category

📝 Abstract

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

Problem

Research questions and friction points this paper is trying to address.

Enabling parallel training for nonlinear RNNs to overcome sequential computation barriers

Overcoming linearity constraints in SSMs to model complex nonlinear dependencies

Scaling nonlinear RNN training to billion-parameter models for competitive performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallelizes nonlinear RNNs via Newton iterations

Solves recurrence equations using custom parallel reductions

Enables large-scale training of nonlinear RNN architectures

🔎 Similar Papers

No similar papers found.