🤖 AI Summary
This study addresses the lack of conversational naturalness in video conferencing by proposing the first multimodal approach to predict sparse yet critical negative subjective experiences—such as stuttering, low enjoyment, and conversational events including delayed responses, interruptions, and silence. Leveraging the RoomReader corpus, we integrate domain-agnostic audio embeddings (Wav2Vec), facial action units (AUs), optical flow, and pose-estimated body motion features to train binary and multiclass classifiers optimized for ROC-AUC. Key contributions include: (1) the first application of multimodal analysis to detect sparse moments of subjective experience degradation in video conferencing; and (2) empirical evidence that domain-agnostic audio features exhibit the highest discriminative power for predicting high-level conversational quality. The model achieves an AUC of 0.87 on an independent test set and demonstrates strong cross-scenario generalizability, enabling real-time user experience monitoring and intervention.
📝 Abstract
Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.