Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

📅 2024-05-06
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-fidelity reconstruction of dynamic visual content from low-temporal-resolution fMRI signals—a task hindered by existing methods’ inability to jointly model semantic, structural, and motion features, as well as hallucination biases introduced by video generation models. We propose a fMRI–vision–language trimodal contrastive learning framework incorporating sparse causal attention, enabling multi-frame-consistent motion prediction and disentangled feature representation under a single fMRI frame constraint. To avoid end-to-end video generation’s instability, we further introduce a next-frame prediction objective and a lightweight dilated Stable Diffusion synthesis module. Our method achieves state-of-the-art performance across multiple public video–fMRI benchmark datasets. Visualization and ablation analyses confirm strong neuroscientific interpretability, with significant improvements in semantic accuracy, structural fidelity, and temporal consistency.

Technology Category

Application Category

📝 Abstract
Reconstructing human dynamic vision from brain activity is a challenging task with great scientific significance. Although prior video reconstruction methods have made substantial progress, they still suffer from several limitations, including: (1) difficulty in simultaneously reconciling semantic (e.g. categorical descriptions), structure (e.g. size and color), and consistent motion information (e.g. order of frames); (2) low temporal resolution of fMRI, which poses a challenge in decoding multiple frames of video dynamics from a single fMRI frame; (3) reliance on video generation models, which introduces ambiguity regarding whether the dynamics observed in the reconstructed videos are genuinely derived from fMRI data or are hallucinations from generative model. To overcome these limitations, we propose a two-stage model named Mind-Animator. During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI. Specifically, we employ fMRI-vision-language tri-modal contrastive learning to decode semantic feature from fMRI and design a sparse causal attention mechanism for decoding multi-frame video motion features through a next-frame-prediction task. In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion, effectively eliminating external video data interference. Extensive experiments on multiple video-fMRI datasets demonstrate that our model achieves state-of-the-art performance. Comprehensive visualization analyses further elucidate the interpretability of our model from a neurobiological perspective. Project page: https://mind-animator-design.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Reconstruct dynamic vision from slow brain activity
Decouple semantic, structure, and motion features
Enhance interpretability of video reconstruction models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples semantic, structure, motion features
Uses fMRI-vision-language tri-modal contrastive learning
Integrates features with inflated Stable Diffusion
Yizhuo Lu
Yizhuo Lu
中科院自动化研究所
人工智能、神经编解码
Changde Du
Changde Du
Institute of Automation, Chinese Academy of Sciences
machine learningcomputer visioncomputational neurosciencebrain-computer interface(BCI)artificial intelligence
C
Chong Wang
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
X
Xuanliu Zhu
Beijing University of Posts and Telecommunications, Beijing, China
L
Liuyun Jiang
Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China, School of Future Technology, University of Chinese Academy of Sciences
Huiguang He
Huiguang He
Institute of Automation, Chinese Academy of Scineces
Artificial Intelligencemedical image processingBrain Computer Interface