🤖 AI Summary
This work addresses the challenge of substantial inter-subject variability in neural representations that typically necessitates subject-specific model training or fine-tuning for cross-subject brain signal decoding. To overcome this, the authors propose a meta-optimized in-context learning approach that, given only a few image–brain activation pairs from a new subject, infers their neural encoding patterns without requiring any training, fine-tuning, anatomical alignment, or stimulus overlap. Leveraging a hierarchical inference mechanism—first estimating voxel-level encoder parameters from multiple stimuli to construct a contextual representation, then aggregating information across voxels for functional inversion—the method significantly enhances generalization, efficiency, and robustness across diverse visual backbone architectures, thereby advancing the development of foundation models for non-invasive brain decoding.
📝 Abstract
Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.