🤖 AI Summary
This work addresses the challenge of rapidly distinguishing transient from persistent failures within sub-millisecond CPU budgets during continuous integration in distributed databases, to inform timely decisions on retrying or escalating errors. To this end, the authors propose SCOUT, a novel online classification framework that leverages strict causal features—such as pre-failure telemetry and historical execution data—and uniquely integrates lightweight state-aware scoring, optional sparse metadata, post-hoc probability calibration, and a posterior soft-correction mechanism. This design effectively mitigates temporal and cross-domain distribution shifts as well as label bias. Evaluated on a benchmark of 3,680 failure runs, SCOUT achieves an end-to-end P95 latency of 1.17 milliseconds in production deployment and has been successfully integrated into TiDB v7/v8 and GitHub Actions for large-scale metadata trace analysis.
📝 Abstract
Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.