🤖 AI Summary
Current code large language models (LLMs) suffer from limited exposure to realistic programming problems, hindering their practical deployment. To address this, we propose a “scenario-centric graph” modeling framework that structurally represents application-domain knowledge, domain-specific skills, and programming skills as a heterogeneous graph. Leveraging real-world data from Stack Overflow and Kaggle, we systematically construct and sample from this graph to generate high-fidelity, diverse programming problems—integrating multi-dimensional skill elements for the first time. Our method enables end-to-end problem synthesis via LLMs. Evaluated on realistic benchmarks—including CodeContests and an APPS subset—it significantly outperforms open-source code-specialized models (e.g., CodeLlama, StarCoder) and general-purpose LLMs (e.g., LLaMA-3) of comparable scale. Results demonstrate that scenario-aware graph modeling critically enhances model generalization to real-world programming tasks.
📝 Abstract
Significant advancements have been made in the capabilities of code large language models, leading to their rapid adoption and application across a wide range of domains. However, their further advancements are often constrained by the scarcity of real-world coding problems. To bridge this gap, we propose a novel framework for synthesizing code problems that emulate authentic real-world scenarios. This framework systematically integrates domain knowledge, domain skills, and coding skills, all of which are meticulously extracted from real-world programming-related datasets, including Stack Overflow and Kaggle. The extracted elements serve as the foundational building blocks for constructing code problems. To align the generated problems with practical applications, application scenarios are also mined from the aforementioned datasets. These scenarios are then utilized to construct a scenario-centric graph that interconnects domain knowledge, domain skills, and coding skills. Based on this structured representation, a sampling strategy on the graph is designed, which effectively controls the generation of a code problem with complexity and diversity, reflects real-world challenges. Experimental results demonstrate that the proposed method consistently achieves superior performance over state-of-the-art open-source large language models of varying sizes and functionalities, including both coders and general-purpose models, across a diverse set of real-world benchmarks.