SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenges of inconsistent object representation and poor temporal coherence in reconstructing videos from fMRI brain activity. To this end, the authors propose SemMiner, a hierarchical semantic-guided framework that extracts three types of semantic cues—static anchors, motion narratives, and global summaries—from source videos. By integrating CLIP-style embedding alignment, a motion-adaptive decoder, and conditional video rendering, the method enables end-to-end video reconstruction. SemMiner employs a tripartite attention fusion architecture to jointly optimize semantic fidelity and temporal consistency, achieving state-of-the-art performance on the CC2017 and HCP datasets. The approach significantly enhances both the semantic accuracy and temporal smoothness of reconstructed videos, marking a notable advance in neural decoding and brain-inspired video generation.

Technology Category

Application Category

📝 Abstract

Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

Problem

Research questions and friction points this paper is trying to address.

fMRI-to-video reconstruction

temporal coherence

visual representation consistency

brain decoding

neural mechanisms of visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical semantic guidance

fMRI-to-video reconstruction

temporal coherence

semantic alignment

motion adaptation

🔎 Similar Papers

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

2024-05-06arXiv.orgCitations: 2