🤖 AI Summary
This work addresses the challenge of interpretable modeling for high-dimensional spatiotemporal data in cellular imaging by proposing a novel architecture that integrates an ℓ₁-regularized vector autoregressive (VAR) model into a convolutional autoencoder. Leveraging skip connections to decouple static spatial features, the VAR component focuses exclusively on sparse dynamic modeling. The framework enables, for the first time, end-to-end differentiable joint training of the ℓ₁-regularized VAR and the autoencoder. This approach simultaneously achieves dimensionality reduction while preserving an interpretable sparse temporal structure, facilitates statistical testing on individual observational units, and generates spatial contribution maps that highlight regions driving dynamics. Evaluated on two-photon calcium imaging data, the method outperforms non-jointly trained baselines in accurately identifying key spatiotemporal regions, effectively balancing representational capacity with statistical interpretability.
📝 Abstract
While artificial neural networks excel in unsupervised learning of non-sparse structure, classical statistical regression techniques offer better interpretability, in particular when sparseness is enforced by $\ell_1$ regularization, enabling identification of which factors drive observed dynamics. We investigate how these two types of approaches can be optimally combined, exemplarily considering two-photon calcium imaging data where sparse autoregressive dynamics are to be extracted. We propose embedding a vector autoregressive (VAR) model as an interpretable regression technique into a convolutional autoencoder, which provides dimension reduction for tractable temporal modeling. A skip connection separately addresses non-sparse static spatial information, selectively channeling sparse structure into the $\ell_1$-regularized VAR. $\ell_1$-estimation of regression parameters is enabled by differentiating through the piecewise linear solution path. This is contrasted with approaches where the autoencoder does not adapt to the VAR model. Having an embedded statistical model also enables a testing approach for comparing temporal sequences from the same observational unit. Additionally, contribution maps visualize which spatial regions drive the learned dynamics.