๐ค AI Summary
Existing research often examines textual, visual, or audio modalities of short videos in isolation, failing to uncover how their interplay influences user engagement. This work proposes the first reproducible and interpretable multimodal analysis framework that integrates automated feature extraction with Shapley-value-based attribution to systematically investigate how multimodal interactions affect view counts in TikTok content related to social anxiety disorder. The study reveals that facial expressions are more predictive than textual sentiment, that informational content garners greater attention than emotional support, and that multimodal synergies exhibit strong threshold-dependent effectsโthereby transcending the limitations of conventional unimodal analyses.
๐ Abstract
Short-form video platforms integrate text, visuals, and audio into complex communicative acts, yet existing research analyzes these modalities in isolation, lacking scalable frameworks to interpret their joint contributions. This study introduces a pipeline combining automated multimodal feature extraction with Shapley value-based interpretability to analyze how text, visuals, and audio jointly influence engagement. Applying this framework to 162,965 TikTok videos and 814,825 images about social anxiety disorder (SAD), we find that facial expressions outperform textual sentiment in predicting viewership, informational content drives more attention than emotional support, and cross-modal synergies exhibit threshold-dependent effects. These findings demonstrate how multimodal analysis reveals interaction patterns invisible to single-modality approaches. Methodologically, we contribute a reproducible framework for interpretable multimodal research applicable across domains; substantively, we advance understanding of mental health communication in algorithmically mediated environments.