Flaky Tests in a Large Industrial Database Management System: An Empirical Study of Fixed Issue Reports for SAP HANA

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study addresses the challenge of flaky tests undermining code quality assessment in the industrial-scale database system SAP HANA. To enable systematic root cause analysis, the authors introduce a large language model (LLM) as an automated annotator—deployed for the first time in a large database system—and combine it with internal and external consistency validation strategies to classify the root causes of 559 resolved flaky test reports. Their analysis reveals that 23% (130 cases) of flakiness stems from concurrency-related issues and uncovers distinct instability challenges across different test types. The proposed approach establishes a scalable new paradigm for empirical analysis and mitigation of flaky tests, offering actionable guidance for improving reliability in complex software systems.

Technology Category

Application Category

📝 Abstract

Flaky tests yield different results when executed multiple times for the same version of the source code. Thus, they provide an ambiguous signal about the quality of the code and interfere with the automated assessment of code changes. While a variety of factors can cause test flakiness, approaches to fix flaky tests are typically tailored to address specific causes. However, the prevalent root causes of flaky tests can vary depending on the programming language, application domain, or size of the software project. Since manually labeling flaky tests is time-consuming and tedious, this work proposes an LLMs-as-annotators approach that leverages intra- and inter-model consistency to label issue reports related to fixed flakiness issues with the relevant root cause category. This allows us to gain an overview of prevalent flakiness categories in the issue reports. We evaluated our labeling approach in the context of SAP HANA, a large industrial database management system. Our results suggest that SAP HANA's tests most commonly suffer from issues related to concurrency (23%, 130 of 559 analyzed issue reports). Moreover, our results suggest that different test types face different flakiness challenges. Therefore, we encourage future research on flakiness mitigation to consider evaluating the generalizability of proposed approaches across different test types.

Problem

Research questions and friction points this paper is trying to address.

flaky tests

root cause analysis

database management system

test reliability

SAP HANA

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs-as-annotators

flaky tests

root cause classification