EpiCoder: Encompassing Diversity and Complexity in Code Generation

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation methods are largely confined to short code snippets, failing to meet demands for high complexity and semantic diversity. To address this, we propose a novel generative framework based on Semantic Feature Trees (SFTs), which transcend the syntactic limitations of Abstract Syntax Trees (ASTs) by modeling semantic features hierarchically. Our approach enables systematic synthesis—from functions and multi-file modules to full repositories—via iterative feature refinement and controllable depth- or breadth-biased subtree sampling. We introduce the first SFT-based synthesis paradigm supporting joint control over complexity and diversity; construct the first repository-level synthetic dataset; and establish a software-engineering–principled complexity quantification metric alongside an LLM-as-a-judge evaluation framework. Experiments demonstrate state-of-the-art performance on multiple function- and file-level benchmarks, and significant improvements in structural complexity, semantic diversity, and functional completeness at the repository level.

Technology Category

Application Category

📝 Abstract
Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.
Problem

Research questions and friction points this paper is trying to address.

Code Generation
Complexity
Diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

EpiCoder
Feature Tree
Code Complexity Control
🔎 Similar Papers
No similar papers found.