π€ AI Summary
This work addresses the challenge of log anomaly detection, where sparse training data often fails to cover the vast space of legitimate execution paths, leading models to misclassify previously unseen yet normal log sequences as anomalies. To mitigate this limitation, the authors propose a source codeβdriven log data augmentation approach that uniquely integrates log-oriented control flow graphs (LCFGs) with chain-of-thought (CoT) reasoning from large language models. By leveraging static program analysis and domain-specific heuristics, the method generates syntactically valid, logically consistent, and accurately labeled log sequences. Experimental results on HDFS and ZooKeeper benchmarks demonstrate substantial improvements in model generalization: across 12 diverse models, average F1 scores increase by 2.18% and 1.69%, respectively, with one unsupervised Transformer achieving a remarkable rise from 0.818 to 0.970 F1 on HDFS.
π Abstract
Log-based anomaly detection is fundamentally constrained by training data sparsity. Our empirical study reveals that public benchmark datasets cover less than 10% of source code log templates. Consequently, models frequently misclassify unseen but valid execution paths as anomalies, leading to false alarms. To address this, we propose AnomalyGen, a novel framework that augments training data by synthesizing labeled log sequences from source code. AnomalyGen combines log-oriented static analysis with Large Language Model (LLM) reasoning in three stages: (1) building Log-Oriented Control Flow Graphs (LCFGs) to enumerate structurally valid execution paths; (2) applying LLM Chain-of-Thought (CoT) reasoning to verify logical consistency and generate realistic runtime parameters (e.g., block IDs, IP addresses); and (3) labeling generated sequences with domain heuristics. Evaluations on HDFS and Zookeeper across 12 diverse anomaly detection models show AnomalyGen consistently improves performance. Deep learning models achieved average F1-score gains of 2.18% (HDFS) and 1.69% (Zookeeper), with an unsupervised Transformer on HDFS jumping from 0.818 to 0.970. Ablation results show that both static analysis and LLM-based verification are necessary: removing them reduces F1 by up to 8.7 and 10.7 percentage points, respectively. Our framework and datasets are publicly available to facilitate future research.