TALP-Pages: An easy-to-integrate continuous performance monitoring framework

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of late detection of performance degradation and delayed scalability analysis in HPC application development, this paper proposes a lightweight, CI-native continuous performance monitoring framework. The method integrates TALP-based real-time performance instrumentation with in-repository regression analysis, enabling low-overhead (no additional tracing) and high-temporal-fidelity monitoring—delivering feedback immediately upon CI build completion. Performance data are stored in a CI-friendly directory structure, and automated HTML reports are generated, visualizing trends in key performance factors as well as strong and weak scaling efficiency. Evaluated in the GENE-X CI environment with zero code modification, the framework demonstrates sensitivity to minute performance improvements (<2%) and reduces post-processing overhead by over 90% compared to conventional tracing tools. This significantly enhances the efficiency and practicality of scalability assessment under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract
Ensuring good performance is a key aspect in the development of codes that target HPC machines. As these codes are under active development, the necessity to detect performance degradation early in the development process becomes apparent. In addition, having meaningful insight into application scaling behavior tightly coupled to the development workflow is helpful. In this paper, we introduce TALP-Pages, an easy-to-integrate framework that enables developers to get fast and in-repository feedback about their code performance using established fundamental performance and scaling factors. The framework relies on TALP, which enables the on-the-fly collection of these metrics. Based on a folder structure suited for CI which contains the files generated by TALP, TALP-Pages generates an HTML report with visualizations of the performance factor regression as well as scaling-efficiency tables. We compare TALP-Pages to tracing-based tools in terms of overhead and post-processing requirements and find that TALP-Pages can produce the scaling-efficiency tables faster and under tighter resource constraints. To showcase the ease of use and effectiveness of this approach, we extend the current CI setup of GENE-X with only minimal changes required and showcase the ability to detect and explain a performance improvement.
Problem

Research questions and friction points this paper is trying to address.

Detects performance degradation early in HPC code development
Provides insight into application scaling behavior during development
Enables fast feedback on code performance using fundamental metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous performance monitoring framework for HPC codes
On-the-fly collection of performance and scaling metrics
HTML reports with visualizations and efficiency tables
🔎 Similar Papers
No similar papers found.