Evaluating the Effectiveness of Coverage-Guided Fuzzing for Testing Deep Learning Library APIs

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses three key challenges in testing deep learning libraries (e.g., PyTorch/TensorFlow): low coverage, poor input validity, and limited scalability. To this end, we propose FlashFuzz—the first automated testing framework that deeply integrates coverage-guided fuzzing (CGF) with large language models (LLMs). Our method introduces a feedback-driven LLM harness synthesis mechanism, combining API documentation parsing, template matching, and iterative repair to generate high-coverage test cases and semantically valid inputs. Evaluated on 1,151 PyTorch and 662 TensorFlow APIs, FlashFuzz achieves over twice the coverage of state-of-the-art approaches and discovers 42 previously unknown vulnerabilities—eight of which have been confirmed and patched. These results demonstrate substantial improvements in both the practicality and scalability of CGF for deep learning system testing.

Technology Category

Application Category

📝 Abstract
Deep Learning (DL) libraries such as PyTorch provide the core components to build major AI-enabled applications. Finding bugs in these libraries is important and challenging. Prior approaches have tackled this by performing either API-level fuzzing or model-level fuzzing, but they do not use coverage guidance, which limits their effectiveness and efficiency. This raises an intriguing question: can coverage guided fuzzing (CGF), in particular frameworks like LibFuzzer, be effectively applied to DL libraries, and does it offer meaningful improvements in code coverage, bug detection, and scalability compared to prior methods? We present the first in-depth study to answer this question. A key challenge in applying CGF to DL libraries is the need to create a test harness for each API that can transform byte-level fuzzer inputs into valid API inputs. To address this, we propose FlashFuzz, a technique that leverages Large Language Models (LLMs) to automatically synthesize API-level harnesses by combining templates, helper functions, and API documentation. FlashFuzz uses a feedback driven strategy to iteratively synthesize and repair harnesses. With this approach, FlashFuzz synthesizes harnesses for 1,151 PyTorch and 662 TensorFlow APIs. Compared to state-of-the-art fuzzing methods (ACETest, PathFinder, and TitanFuzz), FlashFuzz achieves up to 101.13 to 212.88 percent higher coverage and 1.0x to 5.4x higher validity rate, while also delivering 1x to 1182x speedups in input generation. FlashFuzz has discovered 42 previously unknown bugs in PyTorch and TensorFlow, 8 of which are already fixed. Our study confirms that CGF can be effectively applied to DL libraries and provides a strong baseline for future testing approaches.
Problem

Research questions and friction points this paper is trying to address.

Evaluating coverage-guided fuzzing effectiveness for DL library testing
Automating test harness synthesis for API-level fuzzing via LLMs
Comparing bug detection and coverage improvements against prior methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs to synthesize API harnesses automatically
Uses feedback-driven strategy to iteratively repair harnesses
Applies coverage-guided fuzzing to deep learning libraries
🔎 Similar Papers
No similar papers found.
F
Feiran Qin
North Carolina State University, USA
M
M. M. Abid Naziri
North Carolina State University, USA
H
Hengyu Ai
ShanghaiTech University, China
Saikat Dutta
Saikat Dutta
Cornell University
Software EngineeringProbabilistic ProgrammingProgram AnalysisProgramming Languages
Marcelo d'Amorim
Marcelo d'Amorim
Associate Professor, NC State University
Software Engineering