Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing methods struggle to model the hierarchical organization and temporal dynamics of visual cognition, resulting in a significant modality gap between neural responses and visual inputs. To address this, this work proposes NeuroAlign, a novel framework that, for the first time, incorporates the hierarchical structure and temporal dynamics of the biological visual pathway into cross-modal alignment. NeuroAlign achieves fine-grained matching between fMRI signals and video through a two-stage mechanism: it first captures global semantics via Neural-Temporal Contrastive Learning (NTCL), then aligns local patterns using enhanced vector quantization. The framework further introduces a dynamic multimodal fusion module, DynaSyncMM-EMA, along with bidirectional cross-modal prediction. Experiments demonstrate that the proposed method substantially outperforms existing approaches on cross-modal retrieval tasks, offering a new paradigm for understanding the mechanisms underlying visual cognition.

Technology Category

Application Category

📝 Abstract

Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while our DynaSyncMM-EMA approach enables dynamic multi-modal fusion with adaptive weighting. Experiments demonstrate that NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, establishing a new paradigm for understanding visual cognitive mechanisms.

Problem

Research questions and friction points this paper is trying to address.

cross-modal understanding

neural responses

visual stimuli

hierarchical representation

modality gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

NeuroAlign

hierarchical representation learning

neural-video alignment