🤖 AI Summary
This work addresses the lack of semantic annotations in encrypted mobile traffic by proposing T2T, a system that bridges the gap between raw traffic features and user activity semantics. T2T integrates a traffic feature encoder with a text description decoder, leveraging Qwen-VL-Max to automatically annotate synchronized screen recordings and thereby constructs an end-to-end traffic-to-text generation framework. Effective cross-modal training is achieved through a multi-stage loss function. Evaluated on 40,000 real-world samples, the model achieves strong performance with BLEU-4 of 58.1, METEOR of 38.3, ROUGE-L of 70.5, and CIDEr of 108.7, demonstrating that the generated descriptions are both semantically accurate and highly readable, approaching the quality of state-of-the-art vision-language models.
📝 Abstract
This paper studies the creation of textual descriptions of user activities and interactions on smartphones. Our approach of referring to encrypted mobile traffic exceeds traditional smartphone activity classification methods in terms of model scalability and output readability. The paper addresses two obstacles to the realization of this idea: the semantic gap between traffic features and smartphone activity captions, and the lack of textually annotated traffic data. To overcome these challenges, we introduce a novel smartphone activity captioning system, called T2T (Traffic-to-Text). T2T consists of a flow feature encoder that converts low-level traffic characteristics into meaningful latent features and a caption decoder to yield readable transcripts of smartphone activities. In addition, T2T achieves the automatic textual annotation of mobile traffic by feeding synchronized screen capture videos into the Qwen-VL-Max vision-language model, and proposing multi-stage losses for effective cross-model training. We evaluate T2T on 40,000 traffic-description pairs collected in two real-world environments, involving 8 smartphone users and 20 mobile apps. T2T achieves a BLEU-4 score of 58.1, a METEOR score of 38.3, a ROUGE-L score of 70.5, and a CIDEr score of 108.7. The quantitative and qualitative analyses show that T2T can generate semantically accurate captions that are comparable to the vision-language model.