THAPI: Tracing Heterogeneous APIs

📅 2025-03-22

📈 Citations: 1

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Programming heterogeneous exascale HPC systems is hindered by the proliferation of complex, incompatible programming models (e.g., CUDA, SYCL, OpenMP) and the lack of traceability across CPU/GPU execution contexts. Method: We propose the first semantic-aware, full-stack API tracing framework built upon LTTng kernel tracing, integrating user-space dynamic instrumentation with multi-model API signature parsing to capture fine-grained, low-overhead, configurable API call chains across hardware and programming abstractions. Contribution/Results: Unlike conventional tracers logging only function names and timestamps, our framework enables cross-vendor, cross-abstraction behavioral correlation and end-to-end call-chain reconstruction in real HPC applications. It accurately identifies cross-model performance bottlenecks and implementation flaws, improving debugging efficiency by over 3× and significantly enhancing portability and debuggability of heterogeneous programming models.

Technology Category

Application Category

📝 Abstract

As we reach exascale, production High Performance Computing (HPC) systems are increasing in complexity. These systems now comprise multiple heterogeneous computing components (CPUs and GPUs) utilized through diverse, often vendor-specific programming models. As application developers and programming models experts develop higher-level, portable programming models for these systems, debugging and performance optimization requires understanding how multiple programming models stacked on top of each other interact with one another. This paper discusses THAPI (Tracing Heterogeneous APIs), a portable, programming model-centric tracing framework: by capturing comprehensive API call details across layers of the HPC software stack, THAPI enables fine-grained understanding and analysis of how applications interact with programming models and heterogeneous hardware. Leveraging state of the art tracing f ramework like the Linux Trace Toolkit Next Generation (LTTng) and tracing much more than other tracing toolkits, focused on function names and timestamps, this approach enables us to diagnose performance bottlenecks across the software stack, optimize application behavior, and debug programming model implementation issues.

Problem

Research questions and friction points this paper is trying to address.

Understanding interactions between multiple programming models in HPC systems

Debugging and optimizing performance in heterogeneous computing environments

Capturing detailed API calls across HPC software stack layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Portable tracing framework for HPC

Captures multi-layer API call details

Uses LTTng for comprehensive performance analysis

🔎 Similar Papers

No similar papers found.