Efficient algorithms for collecting the statistics of large-scale IP address data

📅 2021-08-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

285K/year

🤖 AI Summary

To address the time and space bottlenecks in high-volume IP address frequency counting for network traffic measurement, this paper proposes a hierarchical dynamic hashing algorithm. The method introduces three key innovations: (1) a novel hierarchical memory-block structure (256 × 8 bytes per block) coupled with a sparsity-aware hash function to minimize collisions and improve cache locality; (2) a dynamic, adaptive hash index length mechanism that balances query efficiency and memory elasticity; and (3) a lightweight multi-threaded parallel acceleration framework. Evaluated on synthetic datasets, the approach achieves over 2.3× higher throughput and reduces memory consumption by approximately 40% compared to baseline methods, while strictly maintaining O(n) time complexity. These improvements significantly enhance practicality and scalability for high-concurrency IP frequency estimation in real-time network monitoring.

📝 Abstract

Compiling the statistics of large-scale IP address data is an essential task in network traffic measurement. The statistical results are used to evaluate the potential impact of user behaviors on network traffic. This requires algorithms that are capable of storing and retrieving a high volume of IP addresses within time and memory constraints. In this paper, we present two efficient algorithms for collecting the statistics of large-scale IP addresses that balance time efficiency and memory consumption. The proposed solutions take into account the sparse nature of the statistics of IP addresses while building the hash function and maintain a dynamic balance among layered memory blocks. There are two layers in the first proposed method, each of which contains a limited number of memory blocks. Each memory block contains 256 elements of size $256 imes 8$ bytes for a 64-bit system. In contrast to built-in hash mapping functions, the proposed solution completely avoids expensive hash collisions while retaining the linear time complexity of hash-based solutions. Moreover, the mechanism dynamically determines the hash index length according to the range of IP addresses, and can balance the time and memory constraints. In addition, we propose an efficient parallel scheme to speed up the collection of statistics. The experimental results on several synthetic datasets show that the proposed method substantially outperforms the baselines with respect to time and memory space efficiency.

Problem

Research questions and friction points this paper is trying to address.

Efficient IP statistic collection

Balancing time and memory

Avoiding hash collisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layered memory block balancing

Collision-free hash function design

Efficient parallel statistics collection

🔎 Similar Papers

AutoFlow: An Autoencoder-based Approach for IP Flow Record Compression with Minimal Impact on Traffic Classification

2024-09-17arXiv.orgCitations: 0

TikTok

Seattle, Washington

Research Scientist