🤖 AI Summary
To address the low quality and high manual cost of seed corpora in fuzz testing, this paper proposes a large language model (LLM)-driven automated corpus generation method. The approach leverages LLMs to synthesize high-coverage search terms, then integrates multi-source web crawling with a multi-stage collaborative filtering pipeline—enforcing syntactic validity, structural diversity, and semantic relevance—to achieve end-to-end corpus construction. Evaluated on 12 real-world programs/libraries, our method improves code coverage by 41.7% on average over naive random corpora and increases vulnerability detection rate by 2.3×. Its effectiveness is statistically indistinguishable from manually curated corpora (p > 0.05), while reducing generation time by 92%. The core contribution lies in the first deep integration of LLMs into the fuzzing corpus generation loop, uniquely balancing semantic guidance with engineering practicality.
📝 Abstract
We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.