Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Current neural encoding models are largely black-box, capturing only correlational relationships between visual stimuli and brain activity without offering causal explanations. This work proposes the MINE framework, which introduces mechanistic interpretability into human visual encoding for the first time by integrating language-aligned image representations with counterfactual image editing. The approach identifies the key semantic features driving the activation of individual voxels and constructs functional selectivity profiles for them. Moving beyond mere correlation, MINE enables causal validation: it not only accurately reconstructs voxel responses elicited by original images but also allows targeted modulation of neural activation through counterfactual manipulations. Applied to canonical visual areas, the method not only recapitulates known feature preferences but also uncovers fine-grained functional heterogeneity within category-selective regions.

📝 Abstract

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

Problem

Research questions and friction points this paper is trying to address.

neural encoding

visual cortex

feature selectivity

mechanistic interpretability

voxel-level activity

Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability

neural encoding

functional selectivity