ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing malware family labeling tools suffer from three critical limitations at scale: erroneous detection parsing, ambiguous family aliasing, and suboptimal antivirus (AV) engine result aggregation—leading to low accuracy, poor scalability, and inadequate support for real-time threat analysis and attribution. To address these bottlenecks, this paper proposes the first variational Bayesian inference–based multi-engine detection aggregation framework. It systematically integrates heterogeneous AV detection outputs, constructs a family alias graph to resolve semantic ambiguities, and employs probabilistic aggregation to suppress noise and bias. Evaluated on MOTIF and MalPedia benchmarks, our method achieves 8% and 12% absolute accuracy improvements, respectively, and enables efficient large-scale labeling of 40 million samples. The framework supports industrial-grade, high-throughput malware analysis pipelines while ensuring robustness and interpretability.

Technology Category

Application Category

📝 Abstract

Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling $approx$40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.

Problem

Research questions and friction points this paper is trying to address.

Automated malware family labeling tool

Addresses accuracy and scalability issues

Uses Variational Bayesian approach for aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Bayesian detection aggregation

Scalable malware dataset processing

Improved accuracy in family labeling

🔎 Similar Papers

Explainable Artificial Intelligence (XAI) for Malware Analysis: A Survey of Techniques, Applications, and Open Challenges