🤖 AI Summary
This work proposes the first modular profiling framework tailored for hardware accelerators, addressing the lack of low-overhead and flexible program analysis tools in modern computing systems. By abstracting underlying performance APIs and integrating with mainstream deep learning frameworks, the framework offers a unified interface to capture runtime events across multiple abstraction levels and enables rapid prototyping. It features a GPU-accelerated backend, multi-level event tracing, and cross-platform compatibility (NVIDIA/AMD), achieving high scalability and minimal profiling overhead in both single- and multi-GPU settings. Experimental results demonstrate that, on representative deep learning workloads, the framework achieves up to 1.3×10⁴ times faster profiling compared to conventional tools while delivering fine-grained performance insights.
📝 Abstract
The increasing complexity and diversity of hardware accelerators in modern computing systems demand flexible, low-overhead program analysis tools. We present PASTA, a low-overhead and modular Program AnalysiS Tool Framework for Accelerators. PASTA abstracts over low-level profiling APIs and diverse deep learning frameworks, offering users a unified interface to capture and analyze runtime events at multiple levels. Its extensible design enables researchers and practitioners to rapidly prototype custom tools with minimal overhead. We demonstrate the utility of PASTA by developing several analysis tools, including a deep learning workload characterization tool and a UVM optimization tool. Through extensive evaluation on mainstream deep learning workloads tested on NVIDIA and AMD GPUs under both single- and multi-GPU scenarios, we demonstrate PASTA's broad applicability. On NVIDIA GPUs, we further show that PASTA provides detailed performance insights with significantly lower overhead, up to 1.3*10^4 faster than conventional analysis tools, thanks to its GPU-accelerated backend. PASTA strikes a practical balance between usability, extensibility, and efficiency, making it well-suited for modern accelerator-based computing environments.