Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitations of existing self-supervised encrypted traffic classification methods, which flatten traffic into byte sequences and thereby disrupt protocol semantics, hindering effective reduction of reliance on labeled data. To overcome this, the authors propose a protocol-native tabular pretraining paradigm that leverages field-level semantics defined by network protocols as architectural priors, reconstructing traffic into a tabular modality. The core innovations include introducing Flow Semantic Units (FSUs) as fundamental building blocks, designing field-specific embeddings, a learnability-guided filtering mechanism, and dual-axis attention, culminating in a novel FSU-based tabular masked autoencoder (FlowSem-MAE). Experiments demonstrate that the proposed method surpasses most fully supervised models using only half the labeled data and achieves state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.

Problem

Research questions and friction points this paper is trying to address.

encrypted traffic classification

protocol semantics

masked modeling

tabular modality

inductive bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

protocol-native

tabular pretraining

encrypted traffic classification