๐ค AI Summary
Architectural fragmentation across AI accelerators hinders efficient cross-platform optimization of GPU kernels.
Method: This paper proposes a platform-agnostic, LLM-driven program synthesis framework comprising a generation agent and a performance analysis agent. It enables single-example-driven cross-platform adaptation through iterative optimization guided by compiler feedback, correctness verification, and parsing of multi-source performance dataโincluding outputs from API- and GUI-based profiling tools.
Contribution/Results: The framework introduces novel cross-architecture knowledge transfer and automates high-performance kernel generation for diverse hardware backends (e.g., NVIDIA CUDA, Apple Metal). Experimental evaluation demonstrates substantial improvements in code quality and optimization efficiency, achieving superior performance over hand-tuned implementations across heterogeneous platforms.
๐ Abstract
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms.
We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.