🤖 AI Summary
This work addresses the challenge of semantic misalignment in fMRI-based image reconstruction, where critical objects are often erroneously replaced or hallucinated. To mitigate this issue, the study introduces explicit semantic parsing into fMRI decoding for the first time, leveraging a grounded vision-language model to generate multi-level textual descriptions that disentangle semantic content from visual appearance. These structured text descriptions are then used to conditionally guide Stable Diffusion 1.4, enabling high-fidelity reconstructions that are semantically aligned with the original neural activity. The proposed method outperforms current state-of-the-art approaches across most quantitative metrics, and human evaluations confirm that the reconstructions better reflect perceived content. Notably, the framework operates efficiently on a single consumer-grade GPU, demonstrating both practicality and scalability.
📝 Abstract
Recent advances in fMRI-based image reconstruction have achieved remarkable photo-realistic fidelity. Yet, a persistent limitation remains: while reconstructed images often appear naturalistic and holistically similar to the target stimuli, they frequently suffer from severe semantic misalignment -- salient objects are often replaced or hallucinated despite high visual quality. In this work, we address this limitation by rethinking the role of explicit semantic interpretation in fMRI decoding. We argue that existing methods rely too heavily on entangled visual embeddings which prioritize low-level appearance cues -- such as texture and global gist -- over explicit semantic identity. To overcome this, we parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding. We achieve this by leveraging grounded VLMs to generate synthetic, human-like, multi-granularity textual representations that capture object identities and spatial organization. Built upon this foundation, we propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model. Extensive experiments demonstrate that SynMind outperforms state-of-the-art methods across most quantitative metrics. Notably, by offloading semantic reasoning to our text-alignment module, SynMind surpasses competing methods based on SDXL while using the much smaller Stable Diffusion 1.4 and a single consumer GPU. Large-scale human evaluations further confirm that SynMind produces reconstructions more consistent with human visual perception. Neurovisualization analyses reveal that SynMind engages broader and more semantically relevant brain regions, mitigating the over-reliance on high-level visual areas.