🤖 AI Summary
This work addresses the limitation of existing visual decoding approaches, which predominantly focus on high-level semantics while neglecting pixel-level details, thereby failing to fully capture the brain’s encoding mechanisms of visual information. To overcome this, the authors propose a hierarchical alignment strategy that integrates multi-scale pre-trained visual encoders, coupled with a contrastive learning objective and a newly designed Fusion Prior mechanism. This approach effectively enhances cross-modal distributional consistency between neural signals and image representations. The method achieves state-of-the-art performance in both quantitative and qualitative evaluations, significantly improving reconstruction fidelity without compromising retrieval accuracy. Notably, it represents the first effort to successfully balance semantic correctness with fine-grained detail preservation in brain-to-image reconstruction.
📝 Abstract
Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.