SeedAIchemy: LLM-Driven Seed Corpus Generation for Fuzzing

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low quality and high manual cost of seed corpora in fuzz testing, this paper proposes a large language model (LLM)-driven automated corpus generation method. The approach leverages LLMs to synthesize high-coverage search terms, then integrates multi-source web crawling with a multi-stage collaborative filtering pipeline—enforcing syntactic validity, structural diversity, and semantic relevance—to achieve end-to-end corpus construction. Evaluated on 12 real-world programs/libraries, our method improves code coverage by 41.7% on average over naive random corpora and increases vulnerability detection rate by 2.3×. Its effectiveness is statistically indistinguishable from manually curated corpora (p > 0.05), while reducing generation time by 92%. The core contribution lies in the first deep integration of LLMs into the fuzzing corpus generation loop, uniquely balancing semantic guidance with engineering practicality.

Technology Category

Application Category

📝 Abstract
We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.
Problem

Research questions and friction points this paper is trying to address.

Automating seed corpus generation for fuzzing
Using LLM workflows to maximize corpus quality
Replacing manual curation with automated corpus collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LLM-driven corpus generation for fuzzing
Five modules collecting files from internet sources
LLM workflows construct search terms for quality
🔎 Similar Papers
No similar papers found.
A
Aidan Wen
University of California, Berkeley
N
Norah A. Alzahrani
Humain, Saudi Arabia
J
Jingzhi Jiang
University of California, Berkeley
A
Andrew Joe
University of California, Berkeley
K
Karen Shieh
University of California, Berkeley
A
Andy Zhang
University of California, Berkeley
Basel Alomair
Basel Alomair
King Abdulaziz City for Science and Technology & University of Washington
Information Security and Cryptography
David Wagner
David Wagner
Professor of Computer Science, UC Berkeley
computer security