🤖 AI Summary
This study addresses critical limitations in the YARA rule ecosystem—namely, ad hoc sharing practices, opaque quality, outdated maintenance, and suboptimal detection efficacy—that severely undermine threat intelligence utility. For the first time, we conduct a large-scale mixed-methods evaluation of 8.4 million YARA rules by integrating GitHub repository mining, static syntactic analysis, and dynamic benchmarking against 4,026 malicious and 2,000 benign samples. Our analysis reveals systemic deficiencies: extreme centralization (80% of rules authored by just ten contributors), aged static supply chains (median inactivity of 782 days, reflecting a 4.2-year technological lag), and inadequate coverage of modern payloads. To overcome these challenges, we propose a paradigm shift from opportunistic rule collection to data-driven YARA rule engineering.
📝 Abstract
YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.