🤖 AI Summary
Existing models for predicting neural population activity rely solely on Pearson correlation, neglecting critical structural information such as temporal dynamics, spatial patterns, and amplitude relationships. To address this limitation, this work establishes the first large-scale autoregressive prediction benchmark using 105 Neuropixels sessions encompassing approximately 89,800 neurons, evaluating seven architectural families—including state-space models (SSMs), Transformers, LSTMs, and spiking neural networks (SNNs). The study introduces a novel evaluation framework that decomposes predictive performance into temporal fidelity, spatial pattern accuracy, and amplitude-invariant alignment. This approach reveals a hierarchical organization of predictability across brain regions (ΔR² = 0.018), identifies a sub-Poisson lower bound on prediction error, and demonstrates the limitations of both artificial-to-spiking neural network distillation and conventional metrics in capturing biophysically constrained performance limits.
📝 Abstract
Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $ΔR^2 = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.