🤖 AI Summary
This work addresses the lack of interpretable, optimization-friendly quantitative metrics for memory utilization in sequence models. We propose Effective State Size (ESS), a novel metric grounded in control theory and signal processing, which quantifies a model’s actual capacity to store and retrieve historical information under constant or time-varying linear operators—moving beyond coarse-grained proxies such as attention maps or cache size. By integrating linear systems theory, spectral analysis, and gradient-driven estimation, ESS enables unified modeling and cross-architectural comparison of attention-based, convolutional, and recurrent models. Experiments demonstrate that ESS guides initialization optimization, informs memory-aware regularization design, improves knowledge distillation efficiency, and reveals differential memory responses to context delimiters in large language models—thereby advancing the frontier of performance–efficiency trade-offs in sequence modeling.
📝 Abstract
The need to develop a general framework for architecture analysis is becoming increasingly important, given the expanding design space of sequence models. To this end, we draw insights from classical signal processing and control theory, to develop a quantitative measure of extit{memory utilization}: the internal mechanisms through which a model stores past information to produce future outputs. This metric, which we call extbf{ extit{effective state-size}} (ESS), is tailored to the fundamental class of systems with extit{input-invariant} and extit{input-varying linear operators}, encompassing a variety of computational units such as variants of attention, convolutions, and recurrences. Unlike prior work on memory utilization, which either relies on raw operator visualizations (e.g. attention maps), or simply the total extit{memory capacity} (i.e. cache size) of a model, our metrics provide highly interpretable and actionable measurements. In particular, we show how ESS can be leveraged to improve initialization strategies, inform novel regularizers and advance the performance-efficiency frontier through model distillation. Furthermore, we demonstrate that the effect of context delimiters (such as end-of-speech tokens) on ESS highlights cross-architectural differences in how large language models utilize their available memory to recall information. Overall, we find that ESS provides valuable insights into the dynamics that dictate memory utilization, enabling the design of more efficient and effective sequence models.