Unboxing Engagement in YouTube Influencer Videos: An Attention-Based Approach

📅 2020-12-22
📈 Citations: 4
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This study addresses the problem of predicting user engagement (e.g., like ratio, comment sentiment) for long-form YouTube videos and quantifying the relative contributions of textual, audio, and visual modalities. To this end, we propose an interpretable multimodal attention-based deep learning framework that integrates self-attention with cross-modal feature interaction, and introduce a novel attention-guided posterior pruning technique to eliminate spurious statistical associations. Empirical results across multiple real-world datasets demonstrate that the textual modality contributes most substantially to engagement prediction; within the first 30 seconds of videos, auditory cues predominantly drive linguistic interaction sentiment, whereas visual cues govern non-linguistic interaction tendencies. The model exhibits strong generalization performance and yields theoretically grounded, actionable marketing insights—such as modality-specific timing effects on viewer behavior—thereby advancing both methodological rigor and practical applicability in multimodal engagement modeling.
📝 Abstract
Influencer marketing has become a widely used strategy for reaching customers. Despite growing interest among influencers and brand partners in predicting engagement with influencer videos, there has been little research on the relative importance of different video data modalities in predicting engagement. We analyze unstructured data from long-form YouTube influencer videos - spanning text, audio, and video images - using an interpretable deep learning framework that leverages model attention to video elements. This framework enables strong out-of-sample prediction, followed by ex-post interpretation using a novel approach that prunes spurious associations. Our prediction-based results reveal that"what is said"through words (text) is more important than"how it is said"through imagery (video images) or acoustics (audio) in predicting video engagement. Interpretation-based findings show that during the critical onset period of a video (first 30 seconds), auditory stimuli (e.g., brand mentions and music) are associated with sentiment expressed in verbal engagement (comments), while visual stimuli (e.g., video images of humans and packaged goods) are linked with sentiment expressed through non-verbal engagement (the thumbs-up/down ratio). We validate our approach through multiple methods, connect our findings to relevant theory, and discuss implications for influencers, brands and agencies.
Problem

Research questions and friction points this paper is trying to address.

Predicting engagement in YouTube influencer videos using multimodal data
Comparing importance of text, audio, and video in engagement prediction
Analyzing auditory and visual stimuli impact during video onset period
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable deep learning framework for video analysis
Attention-based multimodal data prediction model
Novel pruning method for spurious association removal
🔎 Similar Papers
No similar papers found.