🤖 AI Summary
To address high false-positive rates and poor interpretability in real-time analysis of heterogeneous network traffic, this paper proposes a hierarchical Retrieval-Augmented Generation (RAG) framework. First, a traffic-summary vector repository is constructed; then, metadata-aware filtering, Maximal Marginal Relevance (MMR) sampling, two-stage cross-encoder re-ranking, and an active refusal mechanism are integrated to enable evidence-driven large language model (LLM) inference. This architecture substantially mitigates hallucination by strictly grounding analytical conclusions in empirically observed traffic segments. Evaluated on real-world ICMP/TCP flood attack datasets, the method achieves 95.95%–98.82% accuracy—significantly outperforming rule-based systems and diverse machine learning baselines—and is validated through both expert assessment and ground-truth data. The core contribution is the first hierarchical semantic retrieval paradigm tailored for network traffic, coupled with a trustworthy LLM collaborative reasoning framework.
📝 Abstract
Modern networks generate vast, heterogeneous traffic that must be continuously analyzed for security and performance. Traditional network traffic analysis systems, whether rule-based or machine learning-driven, often suffer from high false positives and lack interpretability, limiting analyst trust. In this paper, we present ReGAIN, a multi-stage framework that combines traffic summarization, retrieval-augmented generation (RAG), and Large Language Model (LLM) reasoning for transparent and accurate network traffic analysis. ReGAIN creates natural-language summaries from network traffic, embeds them into a multi-collection vector database, and utilizes a hierarchical retrieval pipeline to ground LLM responses with evidence citations. The pipeline features metadata-based filtering, MMR sampling, a two-stage cross-encoder reranking mechanism, and an abstention mechanism to reduce hallucinations and ensure grounded reasoning. Evaluated on ICMP ping flood and TCP SYN flood traces from the real-world traffic dataset, it demonstrates robust performance, achieving accuracy between 95.95% and 98.82% across different attack types and evaluation benchmarks. These results are validated against two complementary sources: dataset ground truth and human expert assessments. ReGAIN also outperforms rule-based, classical ML, and deep learning baselines while providing unique explainability through trustworthy, verifiable responses.