🤖 AI Summary
This work addresses the high computational cost that often impedes rapid iteration in evaluating sequence model architectures. To this end, it introduces CogScale—the first lightweight synthetic benchmark designed for multi-scale, parameter-constrained evaluation of specific cognitive capabilities. CogScale comprises 14 scalable tasks that efficiently assess models’ memory and reasoning capacities under parameter budgets of 1K, 10K, and 100K. Systematic experiments across seven representative architectures—GRU, LSTM, xLSTM, Echo State Networks (ESN), Mamba, Transformer Decoder, and Encoder-Decoder—reveal that classical RNNs and ESNs excel at basic memory tasks under tight parameter constraints, whereas attention-based and modern state-space models demonstrate superior robustness on more complex reasoning tasks. This framework substantially reduces the cost of architectural validation and offers a standardized tool for sequence modeling research.
📝 Abstract
The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.