🤖 AI Summary
This study addresses the challenge of high-fidelity, coherent language reconstruction from fMRI brain signals in naturalistic multimodal cognitive scenarios—critical for enhancing the ecological validity and real-world decoding capability of brain–computer interfaces (BCIs). To tackle the heterogeneity of neural responses elicited by visual, auditory, and textual stimuli, we propose the first modality-adaptive unified decoding framework. It integrates a vision-language model (VLM) with modality-specific expert networks, enabling cross-modal alignment and joint representation learning of brain activity and semantic content. The framework synergistically models heterogeneous neural inputs while preserving modality-specific characteristics. Evaluated on multimodal language reconstruction, it achieves state-of-the-art performance, demonstrating superior generalization across modalities and enhanced ecological validity. Moreover, its modular design ensures flexibility and scalability. This work establishes a novel paradigm toward practical, robust, and adaptable brain–language interfaces.
📝 Abstract
Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.