🤖 AI Summary
This work addresses the fundamental question of how to optimally allocate a fixed parameter budget among neuron count, per-neuron complexity, and connectivity to maximize sequence modeling performance. To this end, the authors propose a novel recurrent architecture that enables independent tuning of these three dimensions, together with a new Expressive Leaky Memory (ELM) neuron model. Through extensive hyperparameter sweeps and information-theoretic modeling on the SHD-Adding and Enwik8 benchmarks, they systematically uncover a nontrivial Pareto-optimal trade-off between neuron complexity and network scale. Empirical results demonstrate that performance monotonically improves along any single dimension, and under large parameter budgets, the most effective strategy is to jointly increase both neuron count and individual neuron complexity, achieving results that closely approach the theoretical Pareto frontier.
📝 Abstract
Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.