🤖 AI Summary
Internal threat detection (ITD) is severely hindered by the scarcity of high-quality, real-world log data—enterprise logs are sensitive and inaccessible, while public datasets are either small, narrow in scope, or purely synthetic and semantically unrealistic. Method: We propose ChimeraLog, the first large language model–based multi-agent framework for ITD benchmark generation. It employs role-specific agents to model employee behavior and integrates group meetings, pairwise interactions, and autonomous scheduling to simulate 15 distinct internal attacks across technology, finance, and healthcare domains. Contribution/Results: ChimeraLog produces a high-fidelity, semantically rich, and interpretable log dataset. Human evaluation and quantitative analysis confirm its high diversity and realism. Existing ITD methods achieve only an average F1-score of 0.83 on ChimeraLog—significantly lower than 0.99 on CERT—demonstrating its heightened difficulty and value as a rigorous benchmark for advancing ITD research.
📝 Abstract
Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog's higher difficulty and utility for advancing ITD research.