🤖 AI Summary
This study addresses the problem of predicting user engagement (e.g., like ratio, comment sentiment) for long-form YouTube videos and quantifying the relative contributions of textual, audio, and visual modalities. To this end, we propose an interpretable multimodal attention-based deep learning framework that integrates self-attention with cross-modal feature interaction, and introduce a novel attention-guided posterior pruning technique to eliminate spurious statistical associations. Empirical results across multiple real-world datasets demonstrate that the textual modality contributes most substantially to engagement prediction; within the first 30 seconds of videos, auditory cues predominantly drive linguistic interaction sentiment, whereas visual cues govern non-linguistic interaction tendencies. The model exhibits strong generalization performance and yields theoretically grounded, actionable marketing insights—such as modality-specific timing effects on viewer behavior—thereby advancing both methodological rigor and practical applicability in multimodal engagement modeling.
📝 Abstract
Influencer marketing has become a widely used strategy for reaching customers. Despite growing interest among influencers and brand partners in predicting engagement with influencer videos, there has been little research on the relative importance of different video data modalities in predicting engagement. We analyze unstructured data from long-form YouTube influencer videos - spanning text, audio, and video images - using an interpretable deep learning framework that leverages model attention to video elements. This framework enables strong out-of-sample prediction, followed by ex-post interpretation using a novel approach that prunes spurious associations. Our prediction-based results reveal that"what is said"through words (text) is more important than"how it is said"through imagery (video images) or acoustics (audio) in predicting video engagement. Interpretation-based findings show that during the critical onset period of a video (first 30 seconds), auditory stimuli (e.g., brand mentions and music) are associated with sentiment expressed in verbal engagement (comments), while visual stimuli (e.g., video images of humans and packaged goods) are linked with sentiment expressed through non-verbal engagement (the thumbs-up/down ratio). We validate our approach through multiple methods, connect our findings to relevant theory, and discuss implications for influencers, brands and agencies.