๐ค AI Summary
This work addresses the limitations of traditional log compression approaches that follow a โparse-then-compressโ paradigm, which decouples structural templates from dynamic variables and overlooks their inherent redundancy, thereby constraining compression efficiency. To overcome this, we propose LogPrism, a novel framework that abandons rigid pre-parsing and instead jointly models log structure and variables for the first time. LogPrism introduces a Unified Redundancy Tree (URT) to dynamically integrate structure extraction and variable encoding, effectively capturing co-occurrence patterns between structure and variables to exploit contextual redundancy. Experimental results across 16 benchmark datasets demonstrate that LogPrism achieves state-of-the-art performance: it attains the best compression ratios on 13 datasets, outperforming baselines by 4.7%โ80.9%, while achieving a throughput of 29.87 MB/sโup to 43ร faster than competitors. In single-archive mode, it further improves compression ratio and speed by 19.39% and 2.62ร, respectively.
๐ Abstract
In the field of log compression, the prevailing"parse-then-compress"paradigm fundamentally limits effectiveness by treating log parsing and compression as isolated objectives. While parsers prioritize semantic accuracy (i.e., event identification), they often obscure deep correlations between static templates and dynamic variables that are critical for storage efficiency. In this paper, we investigate this misalignment through a comprehensive empirical study and propose LogPrism, a framework that bridges the gap via unified redundancy encoding. Rather than relying on a rigid pre-parsing step, LogPrism dynamically integrates structural extraction with variable encoding by constructing a Unified Redundancy Tree (URT). This hierarchical approach effectively mines"structure+variable"co-occurrence patterns, capturing deep contextual redundancies while accelerating processing through pre-emptive pattern encoding. Extensive experiments on 16 benchmark datasets confirm that LogPrism establishes a new state-of-the-art. It achieves the highest compression ratio on 14 datasets, surpassing existing baselines by margins of 6.12% to 83.34%, while delivering superior throughput at 29.87 MB/s (1.68$\times$~43.04$\times$ faster than competitors). Moreover, when configured in single-archive mode to maximize global pattern discovery, LogPrism boosts its compression ratio by 273.27%, outperforming the best baseline by 19.39% with a 2.62$\times$ speed advantage.