π€ AI Summary
This study decodes visual information from high-density neural recordings in the primate cortex to investigate how neural activity underpins perception. We systematically evaluate the impact of model architecture, training objectives, and data scale on decoding performance, proposing an efficient decoder that combines a lightweight temporal attention module with a shallow multilayer perceptron. Furthermore, we introduce a generative framework integrating low-resolution image reconstruction with semantic-conditioned diffusion. Experiments demonstrate that our approach achieves 70% Top-1 accuracy on image retrieval tasks, substantially outperforming existing methods. Our findings also reveal diminishing returns with increasing input dimensionality and dataset size, underscoring the critical role of temporal dynamics modeling in visual neural decoding.
π Abstract
Understanding how neural activity gives rise to perception is a central challenge in neuroscience. We address the problem of decoding visual information from high-density intracortical recordings in primates, using the THINGS Ventral Stream Spiking Dataset. We systematically evaluate the effects of model architecture, training objectives, and data scaling on decoding performance. Results show that decoding accuracy is mainly driven by modeling temporal dynamics in neural signals, rather than architectural complexity. A simple model combining temporal attention with a shallow MLP achieves up to 70% top-1 image retrieval accuracy, outperforming linear baselines as well as recurrent and convolutional approaches. Scaling analyses reveal predictable diminishing returns with increasing input dimensionality and dataset size. Building on these findings, we design a modular generative decoding pipeline that combines low-resolution latent reconstruction with semantically conditioned diffusion, generating plausible images from 200 ms of brain activity. This framework provides principles for brain-computer interfaces and semantic neural decoding.