SQuaD: The Software Quality Dataset

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing software quality datasets suffer from limited dimensionality and discontinuous temporal coverage, hindering comprehensive multi-dimensional evolutionary analysis. To address this, we propose SQuaD—the first cross-ecosystem, multi-dimensional, and temporally continuous open-source software quality dataset. It encompasses 450 mature projects and 63,586 version releases, integrating outputs from nine static analysis tools (e.g., SonarQube, RefactoringMiner++) to extract over 700 method-, class-, file-, and project-level metrics. SQuaD further fuses version control system (VCS) data, issue reports, and CVE/CWE vulnerability records. Crucially, it enables dual-dimensional quality modeling—capturing both *process* (development activities) and *product* (code attributes)—thereby supporting maintainability assessment, technical debt tracking, and just-in-time defect prediction. The dataset is publicly available (DOI: 10.5281/zenodo.17566690), providing a foundational resource for large-scale empirical software engineering research.

Technology Category

Application Category

📝 Abstract
Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: 10.5281/zenodo.17566690).
Problem

Research questions and friction points this paper is trying to address.

Existing datasets lack comprehensive multi-dimensional software quality metrics
SQuaD integrates 700+ metrics from 450 projects across multiple ecosystems
The dataset enables large-scale research on maintainability and technical debt
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates nine static analysis tools for metrics
Unifies 700 metrics across multiple software levels
Provides version control and vulnerability data integration
🔎 Similar Papers
No similar papers found.