🤖 AI Summary
This work identifies systematic deficiencies in network foundation models (NFMs): underutilized representation space, poor alignment with domain-expert features, and weak robustness to protocol-level perturbations. To address this, we propose the first intrinsic-representation-oriented, three-dimensional evaluation framework—comprising geometric analysis (anisotropy quantification), alignment assessment (metric-space consistency), and causal sensitivity testing (contextual disentanglement capability)—and systematically evaluate four state-of-the-art NFMs across five real-world and controllable network datasets. We find pervasive representation degradation across mainstream models, attributable to the decoupling of training objectives from network semantics. Building on these diagnostics, we design a lightweight post-optimization strategy that preserves model architecture integrity while achieving up to a 0.35 improvement in F1-score. Our results underscore the critical role of representation diagnosis in enabling trustworthy, semantically grounded NFM evolution.
📝 Abstract
This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs) that focuses on hidden representations analysis rather than pure downstream task performance. Different from existing efforts, we analyze the models through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (by up to +0.35 $F_1$ score without architectural changes).