๐ค AI Summary
This work addresses the vanishing/exploding gradient problem and limited representational capacity of conventional recurrent neural networks (RNNs) under finite memory. It proposes MinMax Recurrent Neural Cascades (RNCs), a novel recurrent architecture grounded in MinMax algebraโintroduced here for the first time into recurrent networks. Through a multi-layer cascade design, RNCs simultaneously support sequential and logarithmic-depth parallel computation, guaranteeing bounded hidden states and activations for inputs of arbitrary length. Crucially, state gradients do not decay over time, and loss gradients exist almost everywhere and remain bounded. Theoretically, RNCs are shown to be capable of expressing all regular languages. Empirically, they achieve perfect performance on multiple synthetic tasks, significantly outperforming existing RNN variants, and demonstrate competitive next-token prediction capabilities at a scale of 127 million parameters.
๐ Abstract
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.