SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current code large language models (LLMs) suffer from limited exposure to realistic programming problems, hindering their practical deployment. To address this, we propose a “scenario-centric graph” modeling framework that structurally represents application-domain knowledge, domain-specific skills, and programming skills as a heterogeneous graph. Leveraging real-world data from Stack Overflow and Kaggle, we systematically construct and sample from this graph to generate high-fidelity, diverse programming problems—integrating multi-dimensional skill elements for the first time. Our method enables end-to-end problem synthesis via LLMs. Evaluated on realistic benchmarks—including CodeContests and an APPS subset—it significantly outperforms open-source code-specialized models (e.g., CodeLlama, StarCoder) and general-purpose LLMs (e.g., LLaMA-3) of comparable scale. Results demonstrate that scenario-aware graph modeling critically enhances model generalization to real-world programming tasks.

Technology Category

Application Category

📝 Abstract
Significant advancements have been made in the capabilities of code large language models, leading to their rapid adoption and application across a wide range of domains. However, their further advancements are often constrained by the scarcity of real-world coding problems. To bridge this gap, we propose a novel framework for synthesizing code problems that emulate authentic real-world scenarios. This framework systematically integrates domain knowledge, domain skills, and coding skills, all of which are meticulously extracted from real-world programming-related datasets, including Stack Overflow and Kaggle. The extracted elements serve as the foundational building blocks for constructing code problems. To align the generated problems with practical applications, application scenarios are also mined from the aforementioned datasets. These scenarios are then utilized to construct a scenario-centric graph that interconnects domain knowledge, domain skills, and coding skills. Based on this structured representation, a sampling strategy on the graph is designed, which effectively controls the generation of a code problem with complexity and diversity, reflects real-world challenges. Experimental results demonstrate that the proposed method consistently achieves superior performance over state-of-the-art open-source large language models of varying sizes and functionalities, including both coders and general-purpose models, across a diverse set of real-world benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing real-world code problems from authentic scenarios
Integrating domain knowledge, skills from programming datasets
Generating diverse and complex coding challenges reflecting practicality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts domain knowledge from real datasets
Constructs scenario-centric graph connecting skills
Samples graph for complex diverse problems
🔎 Similar Papers
No similar papers found.
X
Xifeng Yao
Huawei Technologies Co., Ltd.
D
Dongyu Lang
Huawei Technologies Co., Ltd.
W
Wu Zhang
Huawei Technologies Co., Ltd.
X
Xintong Guo
Huawei Technologies Co., Ltd.
H
Huarui Xie
Huawei Technologies Co., Ltd.
Y
Yinhao Ni
Huawei Technologies Co., Ltd.
Ping Liu
Ping Liu
Assistant Professor, Krannert School of Management, Purdue University
Contract theoryGame theoryMacro financeReal OptionsDynamic and Empirical corporate finance
G
Guang Shen
Huawei Technologies Co., Ltd.
Y
Yi Bai
Huawei Technologies Co., Ltd.
D
Dandan Tu
Huawei Technologies Co., Ltd.
C
Changzheng Zhang
Huawei Technologies Co., Ltd.