🤖 AI Summary
Neural networks struggle with sequence length generalization: recurrent models are prone to positional bias, while Transformers are constrained by fixed computational depth. This work proposes MLP-LDRU (Multilayer Perceptron-based Logarithmic-Depth Recurrent Unit), which introduces a logarithmic-depth architecture into recurrent modeling for the first time. By leveraging a parallel reduction mechanism, MLP-LDRU efficiently simulates associative recursive operations. The approach substantially enhances length extrapolation performance, achieving 100% out-of-distribution accuracy on 18 out of 21 regular language tasks and exceeding 99.9% on the remaining three. It also demonstrates strong results on ListOps and natural language classification benchmarks, surpassing the limitations of conventional recurrent and attention-based models.
📝 Abstract
Length generalization remains a persistent challenge for neural networks: recurrent models tend to suffer from positional biases, while transformers are constrained by fixed computational depth. Regular languages provide a frequently used testbed for evaluating length generalization, as label prediction can be checked for any sequence length. We propose MLP-LDRU, a type of Log-Depth Recurrent Unit, which captures a class of associativity-biased operators designed to approximate recurrence through parallel reduction. We evaluate MLP-LDRU on 21 regular-language tasks, consisting of standard benchmarks and new prefix languages, where it achieves 100% out-of-distribution accuracy on 18 tasks and at least 99.9% on the remaining 3 when increasing max training length, outperforming comparable recurrent and attention-based models. We further evaluate MLP-LDRU beyond regular languages on ListOps and NLP classification benchmarks, where it performs competitively.