🤖 AI Summary
While existing visual decoding models effectively reconstruct externally viewed images, they exhibit limited generalization to the task of internal mental imagery reconstruction. This work proposes MIRAGE, the first architecture specifically designed for mental imagery reconstruction, which integrates multimodal text and image features through a linear backbone network to guide a diffusion model in cross-domain reconstruction from fMRI signals. Trained exclusively on data from external visual stimuli, MIRAGE efficiently decodes internal mental images and achieves state-of-the-art performance on the NSD-Imagery benchmark, as validated by both objective metrics and human evaluations. Ablation studies further confirm the critical roles of low-dimensional image features and multi-level guidance mechanisms in enabling high-fidelity reconstruction.
📝 Abstract
To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that--given the right architecture--existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.