FlakyGuard: Automatically Fixing Flaky Tests at Industry Scale

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Flaky tests—non-deterministically passing or failing—significantly impede industrial development workflows. Existing approaches (e.g., FlakyDoctor) fail to guide large language models (LLMs) effectively due to suboptimal context provision: either insufficient information or excessive noisy redundancy. This paper proposes a graph-aware context filtering framework that models source code as a semantic graph and employs a selective graph traversal algorithm to extract the minimal, behaviorally critical context strongly correlated with flakiness. This mechanism substantially improves both the accuracy and interpretability of LLM-driven fixes. Evaluated on a real-world industrial dataset, our method successfully repairs 47.6% of reproducible flaky tests, with 51.8% of repairs accepted by developers—surpassing state-of-the-art methods by ≥22%. Moreover, its root-cause explanations achieve 100% developer validation, marking the first approach to unify high-precision repair with trustworthy, human-validated attribution.

Technology Category

Application Category

📝 Abstract

Flaky tests that non-deterministically pass or fail waste developer time and slow release cycles. While large language models (LLMs) show promise for automatically repairing flaky tests, existing approaches like FlakyDoctor fail in industrial settings due to the context problem: providing either too little context (missing critical production code) or too much context (overwhelming the LLM with irrelevant information). We present FlakyGuard, which addresses this problem by treating code as a graph structure and using selective graph exploration to find only the most relevant context. Evaluation on real-world flaky tests from industrial repositories shows that FlakyGuard repairs 47.6 % of reproducible flaky tests with 51.8 % of the fixes accepted by developers. Besides it outperforms state-of-the-art approaches by at least 22 % in repair success rate. Developer surveys confirm that 100 % find FlakyGuard's root cause explanations useful.

Problem

Research questions and friction points this paper is trying to address.

Automatically fixing non-deterministic flaky tests in industrial settings

Solving context problems in LLM-based test repair approaches

Improving repair success rates for flaky tests using graph exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats code as graph structure for analysis

Uses selective graph exploration for context

Automatically fixes flaky tests at scale

🔎 Similar Papers

Dockerfile Flakiness: Characterization and Repair