Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Low-resource software frameworks (e.g., HarmonyOS) suffer from poor code generation performance by large language models (LLMs) due to insufficient pretraining exposure to framework-specific APIs and syntax. To address this, we propose an API knowledge graph–driven data construction and fine-tuning methodology. Our approach innovatively integrates an API knowledge graph with uncertainty estimation and Monte Carlo Tree Search (MCTS) to automatically generate high-quality single- and multi-API question–answer pairs—without requiring executable code. This yields the first HarmonyOS-specific code generation benchmark dataset. Experiments based on the Qwen model demonstrate that fine-tuning on our API-oriented data improves pass@1 accuracy from 17.59% to 25.00%, substantiating both the effectiveness and generalizability of API-centric data for enhancing LLMs’ code generation capabilities in low-resource frameworks.

Technology Category

Application Category

📝 Abstract

In the context of software frameworks with limited resources (such as HarmonyOS), large language models (LLMs) often exhibit poor code generation performance because they lack sufficient exposure to such environments during pre-training. Although LLMs can usually maintain correct logical structures across programming languages, they frequently struggle when dealing with framework-specific APIs or syntax, resulting in errors. This indicates that while pre-training equips LLMs with general algorithmic capabilities, they remain unfamiliar with the distinctive syntax and API usage of underrepresented frameworks. As a result, even advanced commercial models like GPT-4o cannot reliably generate correct code without prior adaptation. To address this issue, we propose APIKG4SYN, a framework designed to exploit API knowledge graphs for the construction of API-oriented question-code pairs, specifically tailored for low-resource frameworks without requiring executable code. APIKG4SYN integrates both single-API and multi-API knowledge, where the latter is derived through uncertainty estimation (UE)-driven Monte Carlo Tree Search (MCTS), enabling the creation of a diverse and informative dataset for fine-tuning LLMs. Using HarmonyOS as a case study, we build the first benchmark for HarmonyOS code generation. Experimental results show that fine-tuning Qwen with APIKG4SYN raises pass@1 accuracy to 25.00%, compared with 17.59% for the baseline GPT model. These results confirm that API-oriented data significantly enhance LLM performance in low-resource software development scenarios.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with framework-specific APIs in low-resource environments like HarmonyOS

APIKG4SYN constructs API knowledge graphs to generate tailored training data

Fine-tuning with API-oriented data improves code generation accuracy for underrepresented frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

API knowledge graph constructs question-code pairs

Monte Carlo Tree Search integrates multi-API knowledge

Fine-tuning LLMs for low-resource framework code generation

🔎 Similar Papers

On the Effectiveness of Large Language Models in Domain-Specific Code Generation