One task to rule them all: A closer look at traffic classification generalizability

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing traffic classification and website fingerprinting models exhibit severely limited generalization under distribution shift, with performance heavily dependent on specific datasets and environmental assumptions. Method: We propose the first cross-network evaluation framework explicitly targeting real-world network distribution shift—not concept drift—leveraging two large-scale, real-world TLS datasets to conduct cross-domain service identification experiments under a future scenario where Server Name Indication (SNI) is hidden. Results: Even with abundant labeled data, state-of-the-art models achieve only 30–40% accuracy; remarkably, a simple 1-NN classifier performs comparably, challenging the prevailing consensus on the superiority of complex models. Our core contributions are: (i) identifying distribution shift—not concept drift—as the fundamental bottleneck to generalization; (ii) establishing a benchmark framework that isolates and eliminates concept drift confounds; and (iii) empirically demonstrating that lightweight methods offer superior robustness for practical deployment.

Technology Category

Application Category

📝 Abstract
Existing website fingerprinting and traffic classification solutions do not work well when the evaluation context changes, as their performances often heavily rely on context-specific assumptions. To clarify this problem, we take three prior solutions presented for different but similar traffic classification and website fingerprinting tasks, and apply each solution's model to another solution's dataset. We pinpoint dataset-specific and model-specific properties that lead each of them to overperform in their specific evaluation context. As a realistic evaluation context that takes practical labeling constraints into account, we design an evaluation framework using two recent real-world TLS traffic datasets from large-scale networks. The framework simulates a futuristic scenario in which SNIs are hidden in some networks but not in others, and the classifier's goal is to predict destination services in one network's traffic, having been trained on a labelled dataset collected from a different network. Our framework has the distinction of including real-world distribution shift, while excluding concept drift. We show that, even when abundant labeled data is available, the best solutions' performances under distribution shift are between 30% and 40%, and a simple 1-Nearest Neighbor classifier's performance is not far behind. We depict all performances measured on different models, not just the best ones, for a fair representation of traffic models in practice.
Problem

Research questions and friction points this paper is trying to address.

Evaluate traffic classification generalizability across different contexts
Identify dataset-specific and model-specific performance limitations
Assess classifier performance under real-world distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates traffic classification models across different datasets
Simulates real-world TLS traffic with hidden SNIs
Uses 1-Nearest Neighbor for performance comparison
🔎 Similar Papers
No similar papers found.