🤖 AI Summary
To address API hallucination—i.e., LLMs invoking non-existent APIs or misusing existing ones—during code generation, this paper proposes MARIN, a framework leveraging hierarchical dependency mining to model project-level structured context and dependency-constrained decoding to dynamically guide generation. Its core contributions include: (1) the first hierarchical dependency-aware mechanism for capturing cross-file and cross-module API dependencies; (2) APIHulBench, the first dedicated benchmark for evaluating API hallucination; and (3) two fine-grained metrics—Micro-Hallucination Rate (MiHN) and Macro-Hallucination Rate (MaHR)—to quantify hallucination at token- and function-call levels, respectively. Evaluated on six state-of-the-art LLMs, MARIN reduces API hallucination by 67.52% (MiHN) and 73.56% (MaHR) on average. Internal validation on Huawei industrial projects confirms consistent improvements, with reductions of 57.33% (MiHN) and 59.41% (MaHR).
📝 Abstract
Application Programming Interfaces (APIs) are crucial in modern software development. Large Language Models (LLMs) assist in automated code generation but often struggle with API hallucination, including invoking non-existent APIs and misusing existing ones in practical development scenarios. Existing studies resort to Retrieval-Augmented Generation (RAG) methods for mitigating the hallucination issue, but tend to fail since they generally ignore the structural dependencies in practical projects and do not indeed validate whether the generated APIs are available or not. To address these limitations, we propose MARIN, a framework for mitigating API hallucination in code generated by LLMs with hierarchical dependency aware. MARIN consists of two phases: Hierarchical Dependency Mining, which analyzes local and global dependencies of the current function, aiming to supplement comprehensive project context in LLMs input, and Dependency Constrained Decoding, which utilizes mined dependencies to adaptively constrain the generation process, aiming to ensure the generated APIs align with the projects specifications. To facilitate the evaluation of the degree of API hallucination, we introduce a new benchmark APIHulBench and two new metrics including Micro Hallucination Number (MiHN) and Macro Hallucination Rate (MaHR). Experiments on six state-of-the-art LLMs demonstrate that MARIN effectively reduces API hallucinations, achieving an average decrease of 67.52% in MiHN and 73.56% in MaHR compared to the RAG approach. Applied to Huaweis internal projects and two proprietary LLMs, MARIN achieves average decreases of 57.33% in MiHN and 59.41% in MaHR.