KForge: Program Synthesis for Diverse AI Hardware Accelerators

๐Ÿ“… 2025-11-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Architectural fragmentation across AI accelerators hinders efficient cross-platform optimization of GPU kernels. Method: This paper proposes a platform-agnostic, LLM-driven program synthesis framework comprising a generation agent and a performance analysis agent. It enables single-example-driven cross-platform adaptation through iterative optimization guided by compiler feedback, correctness verification, and parsing of multi-source performance dataโ€”including outputs from API- and GUI-based profiling tools. Contribution/Results: The framework introduces novel cross-architecture knowledge transfer and automates high-performance kernel generation for diverse hardware backends (e.g., NVIDIA CUDA, Apple Metal). Experimental evaluation demonstrates substantial improvements in code quality and optimization efficiency, achieving superior performance over hand-tuned implementations across heterogeneous platforms.

Technology Category

Application Category

๐Ÿ“ Abstract
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GPU kernels across diverse AI hardware accelerators
Automating program synthesis through collaborative LLM-based agents
Enabling platform-agnostic code generation with minimal examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Platform-agnostic framework with collaborative LLM-based agents
Iterative refinement system using compilation feedback and profiling
Cross-platform knowledge transfer requiring only single-shot examples
๐Ÿ”Ž Similar Papers
No similar papers found.