🤖 AI Summary
Bridging biological plausibility with Transformer computation remains a fundamental challenge in computational neuroscience and AI.
Method: We establish a mechanistic, differentiable mapping between cortical-thalamic circuits and multi-head self-attention (MHSA) via computational neuroscience modeling, circuit-level equivalence analysis, and analytical derivation of gradients for linear MHSA.
Contribution/Results: We present the first mathematically equivalent cortical-thalamic circuit model of MHSA; propose a functional division hypothesis wherein superficial and deep pyramidal neurons within cortical microcolumns encode attention masks and modulated values, respectively, endowed with differentiable learning; and validate the model across scales using electrophysiological and anatomical data—demonstrating high structural–functional correspondence and yielding analytically tractable, learnable gradients under token-wise MSE loss. Our work uniquely bridges biologically realistic neural circuits with Transformer mechanisms through a testable, mechanistic, and differentiable framework.
📝 Abstract
Both biological cortico-thalamic networks and artificial transformer networks use canonical computations to perform a wide range of cognitive tasks. In this work, we propose that the structure of cortico-thalamic circuits is well suited to realize a computation analogous to multihead self-attention, the main algorithmic innovation of transformers. We start with the concept of a cortical unit module or microcolumn, and propose that superficial and deep pyramidal cells carry distinct computational roles. Specifically, superficial pyramidal cells encode an attention mask applied onto deep pyramidal cells to compute attention-modulated values. We show how to wire such microcolumns into a circuit equivalent to a single head of self-attention. We then suggest the parallel between one head of attention and a cortical area. On this basis, we show how to wire cortico-thalamic circuits to perform multihead self-attention. Along these constructions, we refer back to existing experimental data, and find noticeable correspondence. Finally, as a first step towards a mechanistic theory of synaptic learning in this framework, we derive formal gradients of a tokenwise mean squared error loss for a multihead linear self-attention block.