A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge of rapidly distinguishing transient from persistent failures within sub-millisecond CPU budgets during continuous integration in distributed databases, to inform timely decisions on retrying or escalating errors. To this end, the authors propose SCOUT, a novel online classification framework that leverages strict causal features—such as pre-failure telemetry and historical execution data—and uniquely integrates lightweight state-aware scoring, optional sparse metadata, post-hoc probability calibration, and a posterior soft-correction mechanism. This design effectively mitigates temporal and cross-domain distribution shifts as well as label bias. Evaluated on a benchmark of 3,680 failure runs, SCOUT achieves an end-to-end P95 latency of 1.17 milliseconds in production deployment and has been successfully integrated into TiDB v7/v8 and GitHub Actions for large-scale metadata trace analysis.

Technology Category

Application Category

📝 Abstract

Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.

Problem

Research questions and friction points this paper is trying to address.

flaky failure triage

distributed database

continuous integration

failure classification

online decision making

Innovation

Methods, ideas, or system contributions that make the work stand out.

flaky failure triage

causal online decision

uncertainty calibration