🤖 AI Summary
This study addresses the insufficient region-specific alignment in social cognitive modeling by pioneering the extension of brain-tuning to multimodal (audio-visual) domains, with a focus on the superior temporal sulcus (STS)—a core hub for social processing. Methodologically, we develop a deep neural model that learns joint audio-visual representations and fine-tunes it using fMRI neural responses recorded while participants watched *Friends*, targeting fine-grained activity prediction within the STS and its anatomically adjacent regions. Our contributions are threefold: (1) introducing the first multimodal brain-tuning paradigm; (2) achieving significantly improved fMRI activity prediction accuracy specifically within the STS; and (3) demonstrating substantial performance gains on real-world social semantic tasks—e.g., sarcasm detection in sitcoms—thereby validating the efficacy and generalizability of targeted, region-specific brain-tuning for high-level social cognition modeling.
📝 Abstract
Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.