🤖 AI Summary
High-quality machine learning (ML) operator code in domain-specific architecture programming languages (ASPLs) is scarce, development heavily relies on scarce domain experts, and large language models (LLMs) yield poor generation quality due to low-resource ASPL data. Method: This paper proposes a feedback-driven, multi-stage self-optimizing LLM agent system integrating chain-of-thought reasoning, execution-feedback reinforcement, dynamic tool invocation, and error-driven replanning. It supports collaborative orchestration of open- and closed-source LLMs and enables end-to-end generation and iterative refinement of ML operators from minimal ASPL examples. Contribution/Results: Its core innovation is the first adaptive self-improving agent architecture, effectively alleviating generalization bottlenecks in low-resource ASPL programming. On standard ML library benchmarks, it achieves a 3.9× improvement in code generation accuracy over single-LLM baselines, demonstrating strong efficacy and practicality for high-difficulty, few-shot ASPL code synthesis.
📝 Abstract
ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for experienced human programmers and 2) there are limited code examples because of the esoteric and evolving nature of ASPLs. Therefore, LLMs need complex reasoning with limited data in order to complete this task. To address these challenges, we introduce an adaptive self-improvement agentic system. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9 imes$ over a baseline single LLM.