FullStack Bench: Evaluating LLMs as Full Stack Coders

📅 2024-11-30
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing code evaluation benchmarks are often confined to single domains or programming languages, limiting comprehensive assessment of large language models’ full-stack programming capabilities. To address this, we propose FullStack Bench—the first multi-domain, multi-language benchmark specifically designed for full-stack programming, covering foundational programming, data science, software engineering, mathematics, and machine learning. Complementing the benchmark, we introduce SandboxFusion, a sandboxed execution framework supporting 16 programming languages. Our approach innovatively employs real-world development instructions and native, language-specific unit tests—eschewing translation-based cross-lingual evaluation. Experimental results reveal substantial performance gaps among state-of-the-art code LLMs across full-stack tasks. FullStack Bench and SandboxFusion jointly enable efficient, fair, and fine-grained quantification of models’ cross-domain and cross-lingual programming proficiency.

Technology Category

Application Category

📝 Abstract
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs across diverse full-stack coding domains
Assessing multilingual programming with real-world scenarios
Developing a sandbox tool for efficient code evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive full-stack programming evaluation dataset
Multilingual real-world instructions and test cases
Code sandbox execution tool for diverse languages
🔎 Similar Papers
No similar papers found.
Y
Yao Cheng
Jianfeng Chen
Jianfeng Chen
Meta Inc.
Automated Software EngineeringProgramming LanguageTesting
J
Jie Chen
L
Li Chen
L
Liyu Chen
Wentao Chen
Wentao Chen
Shanghai Jiao Tong University
Natural Language ProcessingMachine LearningRepresentation Learning
Z
Zhengyu Chen
Shijie Geng
Shijie Geng
A
Aoyan Li
B
Bowen Li
B
Bowen Li
L
Linyi Li
Boyi Liu
Boyi Liu
Snowflake AI Research
Reinforcement LearningLLMAI Agent
J
Jerry Liu
K
Kaibo Liu
Q
Qi Liu
Shukai Liu
Shukai Liu
Beihang university
S
Si-Han Liu
T
Tianyi Liu
T
Tingkai Liu
Yongfei Liu
Yongfei Liu
R
Rui Long
J
Jing Mai
Guanghan Ning
Guanghan Ning
ByteDance
Foundation Models
Z
Z. Peng
Kai Shen
Kai Shen
Associate Professor of Computer Science, University of Rochester
Computer Systems
J
Jiahao Su
J
Jing Su
T
Tao Sun
Y
Yifan Sun
Y
Yu Tao
G
Guoyin Wang
Siwei Wang
Siwei Wang
National University of Defense Technology
Large-graph studymulti-view fusionmulti-view clustering
Xuwu Wang
Xuwu Wang
ByteDance
Yite Wang
Yite Wang
Research Scientist, Snowflake
Large Language ModelEfficient Deep LearningComputer VisionNatural Language Processing
Z
Zihan Wang
Jinxiang Xia
Jinxiang Xia
Beihang University
L
Liang Xiang
X
Xianzhong Xiao
Y
Yongsheng Xiao
Chenguang Xi
Chenguang Xi
Machine Learning Engineer, ByteDance
Code LLM
S
Shulin Xin
J
Jingjing Xu
S
Shi-Bo Xu
Hongxia Yang
Hongxia Yang
Professor, HK Polytechnic University
Machine LearningGenerative AICognitive IntelligenceStatistical Modeling
Jack Yang
Jack Yang
Senior Lecturer, University of New South Wales
Computational Material Science
Y
Yingxiang Yang
J
Jian-Ming Yuan
J
Jun Zhang
Y
Yufeng Zhang
Yuyu Zhang
Yuyu Zhang
Research Scientist, ByteDance
Machine Learning
Shen Zheng
Shen Zheng
Research Scientist, Bytedance Seed
Large Language Model
H
He Zhu
M
Ming Zhu