MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current fMRI-to-video reconstruction methods are limited to single-shot clips, failing to address three key challenges in realistic multi-shot scenarios: (1) inter-shot mixing of fMRI signals, (2) temporal misalignment between low-resolution fMRI and high-frame-rate video causing motion blur, and (3) absence of large-scale, temporally aligned multi-shot fMRI–video datasets. This paper introduces the first divide-and-conquer framework for multi-shot video reconstruction: (1) a novel shot-boundary prediction module enabling spatiotemporal decoupling of fMRI signals; (2) integration of large language model–generated semantic keyframe descriptions to mitigate temporal ambiguity; and (3) construction of the first large-scale synthetic multi-shot fMRI–video dataset. Our method combines CLIP-guided semantic optimization with diffusion-based video generation. Experiments show that signal decomposition improves decoded text–CLIP similarity by 71.8%, and multi-shot reconstruction quality substantially surpasses state-of-the-art methods—establishing a new paradigm for visual cognitive modeling and high-fidelity brain–computer interfaces.

Technology Category

Application Category

📝 Abstract
Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing multi-shot videos from fMRI signals
Overcoming fMRI signal mixing across video shots
Addressing temporal resolution mismatch in fMRI-video data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shot boundary predictor for fMRI signal decomposition
LLM-based generative keyframe captioning
Large-scale data synthesis from existing datasets
🔎 Similar Papers
No similar papers found.
W
Wenwen Zeng
Fudan University
Y
Yonghuang Wu
Fudan University
Y
Yifan Chen
Fudan University
Xuan Xie
Xuan Xie
Macau University of Science and Technology
Trustworthy LLMCyber Physical SystemNeural Network Verification
C
Chengqian Zhao
Fudan University
F
Feiyu Yin
Fudan University
G
Guoqing Wu
Fudan University
Jinhua Yu
Jinhua Yu
Sun Yat-sen University
Remote sensing