🤖 AI Summary
This work addresses the challenge of assessing user-perceived video quality (QoE) in end-to-end encrypted video communications, where direct access to raw media content is unavailable. The authors propose qAttCNN, a novel model that, for the first time, integrates self-attention mechanisms with convolutional neural networks to predict no-reference QoE metrics—specifically BRISQUE and frame rate (FPS)—using only packet size information from encrypted traffic. Evaluated on a WhatsApp video call dataset, the model achieves mean absolute percentage errors of 2.14% for BRISQUE and 7.39% for FPS, significantly outperforming existing approaches. This demonstrates that high-accuracy QoE inference is feasible without decrypting traffic or accessing media content, offering a practical solution for quality monitoring in privacy-preserving communication systems.
📝 Abstract
The rapid growth of multimedia consumption, driven by major advances in mobile devices since the mid-2000s, has led to widespread use of video conferencing applications (VCAs) such as Zoom and Google Meet, as well as instant messaging applications (IMAs) like WhatsApp and Telegram, which increasingly support video conferencing as a core feature. Many of these systems rely on the Web Real-Time Communication (WebRTC) protocol, enabling direct peer-to-peer media streaming without requiring a third-party server to relay data, reducing the latency and facilitating a real-time communication. Despite WebRTC's potential, adverse network conditions can degrade streaming quality and consequently reduce users'Quality of Experience (QoE). Maintaining high QoE therefore requires continuous monitoring and timely intervention when QoE begins to deteriorate. While content providers can often estimate QoE by directly comparing transmitted and received media, this task is significantly more challenging for internet service providers (ISPs). End-to-end encryption, commonly used by modern VCAs and IMAs, prevent ISPs from accessing the original media stream, leaving only Quality of Service (QoS) and routing information available. To address this limitation, we propose the QoE Attention Convolutional Neural Network (qAttCNN), a model that leverages packet size parameter of the traffic to infer two no-reference QoE metrics viz. BRISQUE and frames per second (FPS). We evaluate qAttCNN on a custom dataset collected from WhatsApp video calls and compare it against existing QoE models. Using mean absolute error percentage (MAEP), our approach achieves 2.14% error for BRISQUE and 7.39% for FPS prediction.