Policy Testing with MDPFuzz (Replicability Study)

📅 2024-09-11
🏛️ International Symposium on Software Testing and Analysis
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses functional testing of reinforcement learning (RL) policies in black-box Markov decision processes (MDPs), focusing on the validity of coverage-guided fuzzing for detecting policy failures. Method: We systematically reproduce and extend the MDPFuzz framework, conducting empirical boundary analysis of its coverage-guidance mechanism across seven representative MDP benchmarks. We compare MDPFuzz against its ablated variant (without coverage guidance), a random testing baseline, and perform parameter sensitivity analysis. Contribution/Results: Contrary to MDPFuzz’s core assumption, the Gaussian Mixture Model (GMM)-based coverage metric fails to improve—often degrading—fault detection performance: the ablated version discovers 23.6% more faults on average. Coverage guidance exhibits high parameter sensitivity and low robustness across environments. This study provides the first empirical evidence challenging MDPFuzz’s claimed efficacy, revealing fundamental limitations in its coverage model. Our findings offer critical empirical insights and methodological reflections for designing and evaluating coverage-guided testing in black-box RL.

Technology Category

Application Category

📝 Abstract
In recent years, following tremendous achievements in Reinforcement Learning, a great deal of interest has been devoted to ML models for sequential decision-making. Together with these scientific breakthroughs/advances, research has been conducted to develop automated functional testing methods for finding faults in black-box Markov decision processes. Pang et al. (ISSTA 2022) presented a black-box fuzz testing framework called MDPFuzz. The method consists of a fuzzer whose main feature is to use Gaussian Mixture Models (GMMs) to compute coverage of the test inputs as the likelihood to have already observed their results. This guidance through coverage evaluation aims at favoring novelty during testing and fault discovery in the decision model. Pang et al. evaluated their work with four use cases, by comparing the number of failures found after twelve-hour testing campaigns with or without the guidance of the GMMs (ablation study). In this paper, we verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication. We re-implemented the proposed methodology and evaluated our replication in a large-scale study that extends the original four use cases with three new ones. Furthermore, we compare MDPFuzz and its ablated counterpart with a random testing baseline. We also assess the effectiveness of coverage guidance for different parameters, something that has not been done in the original evaluation. Despite this parameter analysis and unlike Pang et al.'s original conclusions, we find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.
Problem

Research questions and friction points this paper is trying to address.

Automated functional testing for black-box MDPs
Replication and limits of MDPFuzz methodology
Effectiveness of coverage guidance in fault discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Gaussian Mixture Models (GMMs)
Implements black-box fuzz testing
Compares with random testing baseline
🔎 Similar Papers
No similar papers found.