SeedAIchemy: LLM-Driven Seed Corpus Generation for Fuzzing

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
To address the low quality and high manual cost of seed corpora in fuzz testing, this paper proposes a large language model (LLM)-driven automated corpus generation method. The approach leverages LLMs to synthesize high-coverage search terms, then integrates multi-source web crawling with a multi-stage collaborative filtering pipeline—enforcing syntactic validity, structural diversity, and semantic relevance—to achieve end-to-end corpus construction. Evaluated on 12 real-world programs/libraries, our method improves code coverage by 41.7% on average over naive random corpora and increases vulnerability detection rate by 2.3×. Its effectiveness is statistically indistinguishable from manually curated corpora (p > 0.05), while reducing generation time by 92%. The core contribution lies in the first deep integration of LLMs into the fuzzing corpus generation loop, uniquely balancing semantic guidance with engineering practicality.

Technology Category

Application Category

📝 Abstract
We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.
Problem

Research questions and friction points this paper is trying to address.

Automating seed corpus generation for fuzzing
Using LLM workflows to maximize corpus quality
Replacing manual curation with automated corpus collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LLM-driven corpus generation for fuzzing
Five modules collecting files from internet sources
LLM workflows construct search terms for quality