🤖 AI Summary
Despite theoretical support for infinite context, state space models (SSMs), linear RNNs, and other sequence architectures exhibit substantially degraded performance on ultra-long sequences in practice, with large inter-architectural disparities in extrapolation capability.
Method: We conduct the first systematic empirical evaluation of SSMs, linear RNNs, and Transformer variants across controlled synthetic tasks and real-world long-text benchmarks, analyzing their context scaling behavior and generalization curves.
Results: All models suffer sharp performance drops beyond certain sequence lengths, indicating a fundamental gap between theoretical infinite-context capacity and empirical efficacy. Crucially, inductive bias—not parameter count or training scale—emerges as the dominant factor governing practical long-range modeling effectiveness. Our findings challenge prevailing assumptions about asymptotic context scalability and provide attributable, evidence-based insights into the failure mechanisms of long-range dependency modeling.
📝 Abstract
Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.