BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current fMRI image decoding methods predominantly rely on CLIP’s final semantic layer or parameter-heavy VAE decoders, overlooking the rich object-level representations encoded in CLIP’s intermediate layers and misaligning with the functional hierarchy of the visual cortex. To address this, we propose a parameter-efficient hierarchical alignment framework. Our method introduces, for the first time, a functional-hierarchy-guided multi-layer feature fusion mechanism that maps fMRI signals jointly to CLIP’s intermediate (object-level) and final (semantic-level) layers—eliminating the VAE decoding pathway entirely. Coupled with cross-reconstruction strategies and multi-granularity loss functions, our approach enables end-to-end decoding. It preserves high-level semantic accuracy while substantially improving fine-grained detail fidelity. Our model achieves state-of-the-art (SOTA) performance on semantic evaluation metrics and reduces parameter count by 71.7% compared to VAE-based SOTA methods, striking an optimal balance between efficiency and reconstruction quality.

Technology Category

Application Category

📝 Abstract
Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7%(Table. ef{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.
Problem

Research questions and friction points this paper is trying to address.

Aligns fMRI signals to CLIP's intermediate and final layers
Eliminates need for parameter-heavy VAE pipelines in brain decoding
Captures visual details missed by CLIP-only approaches efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer CLIP fusion guided by brain hierarchy
Cross-reconstruction strategy with multi-granularity loss
Parameter-efficient approach eliminating VAE pipeline
🔎 Similar Papers
No similar papers found.
T
Tian Xia
Northwest University, Xi'an, China
Zihan Ma
Zihan Ma
Xi'an Jiaotong University
NLPSocial NetworkMulti Modal Learning
X
Xinlong Wang
Northwest University, Xi'an, China
Q
Qing Liu
Northwest University, Xi'an, China
X
Xiaowei He
Northwest University, Xi'an, China
Tianming Liu
Tianming Liu
Distinguished Research Professor of Computer Science, University of Georgia
BrainBrain-Inspired AILLMArtificial General IntelligenceQuantum AI
Y
Yudan Ren
Northwest University, Xi'an, China