BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current fMRI image decoding methods predominantly rely on CLIP’s final semantic layer or parameter-heavy VAE decoders, overlooking the rich object-level representations encoded in CLIP’s intermediate layers and misaligning with the functional hierarchy of the visual cortex. To address this, we propose a parameter-efficient hierarchical alignment framework. Our method introduces, for the first time, a functional-hierarchy-guided multi-layer feature fusion mechanism that maps fMRI signals jointly to CLIP’s intermediate (object-level) and final (semantic-level) layers—eliminating the VAE decoding pathway entirely. Coupled with cross-reconstruction strategies and multi-granularity loss functions, our approach enables end-to-end decoding. It preserves high-level semantic accuracy while substantially improving fine-grained detail fidelity. Our model achieves state-of-the-art (SOTA) performance on semantic evaluation metrics and reduces parameter count by 71.7% compared to VAE-based SOTA methods, striking an optimal balance between efficiency and reconstruction quality.

Technology Category

Application Category

📝 Abstract

Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7%(Table. ef{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.

Problem

Research questions and friction points this paper is trying to address.

Aligns fMRI signals to CLIP's intermediate and final layers

Eliminates need for parameter-heavy VAE pipelines in brain decoding

Captures visual details missed by CLIP-only approaches efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer CLIP fusion guided by brain hierarchy

Cross-reconstruction strategy with multi-granularity loss

Parameter-efficient approach eliminating VAE pipeline

🔎 Similar Papers

No similar papers found.

Authors to Follow