🤖 AI Summary
Current fMRI image decoding methods predominantly rely on CLIP’s final semantic layer or parameter-heavy VAE decoders, overlooking the rich object-level representations encoded in CLIP’s intermediate layers and misaligning with the functional hierarchy of the visual cortex. To address this, we propose a parameter-efficient hierarchical alignment framework. Our method introduces, for the first time, a functional-hierarchy-guided multi-layer feature fusion mechanism that maps fMRI signals jointly to CLIP’s intermediate (object-level) and final (semantic-level) layers—eliminating the VAE decoding pathway entirely. Coupled with cross-reconstruction strategies and multi-granularity loss functions, our approach enables end-to-end decoding. It preserves high-level semantic accuracy while substantially improving fine-grained detail fidelity. Our model achieves state-of-the-art (SOTA) performance on semantic evaluation metrics and reduces parameter count by 71.7% compared to VAE-based SOTA methods, striking an optimal balance between efficiency and reconstruction quality.
📝 Abstract
Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7%(Table.
ef{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.