🤖 AI Summary
Existing traffic classification and website fingerprinting models exhibit severely limited generalization under distribution shift, with performance heavily dependent on specific datasets and environmental assumptions. Method: We propose the first cross-network evaluation framework explicitly targeting real-world network distribution shift—not concept drift—leveraging two large-scale, real-world TLS datasets to conduct cross-domain service identification experiments under a future scenario where Server Name Indication (SNI) is hidden. Results: Even with abundant labeled data, state-of-the-art models achieve only 30–40% accuracy; remarkably, a simple 1-NN classifier performs comparably, challenging the prevailing consensus on the superiority of complex models. Our core contributions are: (i) identifying distribution shift—not concept drift—as the fundamental bottleneck to generalization; (ii) establishing a benchmark framework that isolates and eliminates concept drift confounds; and (iii) empirically demonstrating that lightweight methods offer superior robustness for practical deployment.
📝 Abstract
Existing website fingerprinting and traffic classification solutions do not work well when the evaluation context changes, as their performances often heavily rely on context-specific assumptions. To clarify this problem, we take three prior solutions presented for different but similar traffic classification and website fingerprinting tasks, and apply each solution's model to another solution's dataset. We pinpoint dataset-specific and model-specific properties that lead each of them to overperform in their specific evaluation context.
As a realistic evaluation context that takes practical labeling constraints into account, we design an evaluation framework using two recent real-world TLS traffic datasets from large-scale networks. The framework simulates a futuristic scenario in which SNIs are hidden in some networks but not in others, and the classifier's goal is to predict destination services in one network's traffic, having been trained on a labelled dataset collected from a different network. Our framework has the distinction of including real-world distribution shift, while excluding concept drift. We show that, even when abundant labeled data is available, the best solutions' performances under distribution shift are between 30% and 40%, and a simple 1-Nearest Neighbor classifier's performance is not far behind. We depict all performances measured on different models, not just the best ones, for a fair representation of traffic models in practice.