π€ AI Summary
Existing log anomaly detection methods struggle to achieve message-level precision and face challenges such as coexistence of normal and anomalous instances within the same log template, subsystem heterogeneity, and high annotation costs. This work proposes a label-efficient Mixture-of-Experts (MoE) framework that, for the first time, incorporates failure-domain awareness into the MoE architecture. Requiring only a small number of template-level labels, the approach leverages a single offline invocation of a large language model (LLM) to construct failure-domain partitions, then combines a lightweight routing network with domain-specific expert models to enable online message-level anomaly detection and fault-domain identification. Evaluated on the BGL dataset with K=100, the method achieves an F1 score of 98.16, reduces annotation effort by 76Γ, and detects 86.3% of anomalies with previously unseen EventIDs; on Thunderbird, it attains an F1 score of 99.95 with perfect recall, substantially improving generalization to unknown events.
π Abstract
Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.