Semantic-Aware Parsing for Security Logs

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Security logs exhibit high heterogeneity and structural sparsity, impeding efficient querying and cross-log correlation for analysts. Existing AI-based parsers focus on syntactic template learning but lack semantic interpretability; direct LLM invocation incurs high computational cost and vulnerability to prompt injection attacks. This paper proposes Matryoshka, an end-to-end semantic-aware log parsing system featuring a novel two-layer architecture: “regular-expression-based syntactic parsing + semantic variable clustering and mapping.” It achieves the first fully automated semantic alignment of log fields to the Open Cybersecurity Schema Framework (OCSF) standard. Matryoshka integrates LLM-powered semantic understanding, unsupervised variable clustering, and OCSF schema mapping, accompanied by a dedicated benchmarking framework. Experiments demonstrate: (1) syntactic parsing accuracy surpasses state-of-the-art baselines; (2) semantic alignment attains F1 = 0.95 on real-world security queries; and (3) substantial reduction in manual parsing effort, enabling interoperable, cross-system log analysis.

Technology Category

Application Category

📝 Abstract

Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks. In this paper, we introduce Matryoshka, the first end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers. Matryoshka combines a novel syntactic parser-employing precise regular expressions rather than wildcards-with a completely new semantic parsing layer that clusters variables and maps them into a queryable, contextually meaningful schema. This approach provides analysts with queryable and semantically rich data representations, facilitating rapid and precise log querying without the traditional burden of manual parser construction. Additionally, Matryoshka can map the newly created fields to recognized attributes within the Open Cybersecurity Schema Framework (OCSF), enabling interoperability. We evaluate Matryoshka on a newly curated real-world log benchmark, introducing novel metrics to assess how consistently fields are named and mapped across logs. Matryoshka's syntactic parser outperforms prior works, and the semantic layer achieves an F1 score of 0.95 on realistic security queries. Although mapping fields to the extensive OCSF taxonomy remains challenging, Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable fields, moving us closer to fully automated, AI-driven log analytics.

Problem

Research questions and friction points this paper is trying to address.

Heterogeneous and unstructured logs hinder efficient querying and correlation

Existing parsers lack semantic interpretation for effective log querying

Large language models face scalability and security issues with raw logs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based end-to-end semantic log parsing

Syntactic parser with precise regular expressions

Semantic layer clustering variables into queryable schema

🔎 Similar Papers

Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging