From Coverage to Causes: Data-Centric Fuzzing for JavaScript Engines

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Coverage-guided fuzzing for JavaScript engines often overlooks non-coverage-triggered vulnerabilities, while heuristic strategies rely heavily on domain expertise and suffer from poor generalizability. To address these limitations, this paper proposes a data-driven, feature-guided fuzzing approach. It automatically learns static code features and lightweight dynamic execution patterns from historical vulnerability reports, introducing a novel feature extraction paradigm that jointly leverages syntax/semantic analysis and trace-flag monitoring. Through iterative LLM prompting and XGBoost-based modeling, the method identifies 41 minimal discriminative features, significantly pruning ineffective search space. Experimental evaluation demonstrates that the derived feature set achieves >85% vulnerability detection accuracy and <1% false positive rate using only 25% of the features. The approach successfully reproduces known vulnerabilities in V8 and discovers multiple previously unknown ones.

Technology Category

Application Category

📝 Abstract

Context: Exhaustive fuzzing of modern JavaScript engines is infeasible due to the vast number of program states and execution paths. Coverage-guided fuzzers waste effort on low-risk inputs, often ignoring vulnerability-triggering ones that do not increase coverage. Existing heuristics proposed to mitigate this require expert effort, are brittle, and hard to adapt. Objective: We propose a data-centric, LLM-boosted alternative that learns from historical vulnerabilities to automatically identify minimal static (code) and dynamic (runtime) features for detecting high-risk inputs. Method: Guided by historical V8 bugs, iterative prompting generated 115 static and 49 dynamic features, with the latter requiring only five trace flags, minimizing instrumentation cost. After feature selection, 41 features remained to train an XGBoost model to predict high-risk inputs during fuzzing. Results: Combining static and dynamic features yields over 85% precision and under 1% false alarms. Only 25% of these features are needed for comparable performance, showing that most of the search space is irrelevant. Conclusion: This work introduces feature-guided fuzzing, an automated data-driven approach that replaces coverage with data-directed inference, guiding fuzzers toward high-risk states for faster, targeted, and reproducible vulnerability discovery. To support open science, all scripts and data are available at https://github.com/KKGanguly/DataCentricFuzzJS .

Problem

Research questions and friction points this paper is trying to address.

Automatically identifies minimal static and dynamic features for detecting high-risk inputs

Replaces coverage-guided fuzzing with data-driven inference to target vulnerabilities

Reduces instrumentation cost and irrelevant search space in JavaScript engine fuzzing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric fuzzing with static and dynamic features

LLM-boosted feature generation from historical vulnerabilities

XGBoost model predicts high-risk inputs for targeted fuzzing

🔎 Similar Papers

No similar papers found.