A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of rapidly distinguishing transient from persistent failures within sub-millisecond CPU budgets during continuous integration in distributed databases, to inform timely decisions on retrying or escalating errors. To this end, the authors propose SCOUT, a novel online classification framework that leverages strict causal features—such as pre-failure telemetry and historical execution data—and uniquely integrates lightweight state-aware scoring, optional sparse metadata, post-hoc probability calibration, and a posterior soft-correction mechanism. This design effectively mitigates temporal and cross-domain distribution shifts as well as label bias. Evaluated on a benchmark of 3,680 failure runs, SCOUT achieves an end-to-end P95 latency of 1.17 milliseconds in production deployment and has been successfully integrated into TiDB v7/v8 and GitHub Actions for large-scale metadata trace analysis.

Technology Category

Application Category

📝 Abstract
Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.
Problem

Research questions and friction points this paper is trying to address.

flaky failure triage
distributed database
continuous integration
failure classification
online decision making
Innovation

Methods, ideas, or system contributions that make the work stand out.

flaky failure triage
causal online decision
uncertainty calibration
label bias correction
distributed database CI
🔎 Similar Papers
No similar papers found.
J
Jun-Peng Zhu
Northwest A&F University, PingCAP
Q
Qizhi Wang
PingCAP
Y
Yulong Zhai
PingCAP
Y
Yishen Sun
PingCAP
Sen Chen
Sen Chen
Professor, Nankai University
Software SecurityVulnerabilityMalwareSoftware Supply Chain Security
K
Kai Xu
PingCAP
P
Peng Cai
East China Normal University
H
Hongming Zhang
Northwest A&F University
H
Heng Long
PingCAP
L
Liu Tang
PingCAP
Q
Qi Liu
PingCAP