FullStack Bench: Evaluating LLMs as Full Stack Coders

📅 2024-11-30

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing code evaluation benchmarks are often confined to single domains or programming languages, limiting comprehensive assessment of large language models’ full-stack programming capabilities. To address this, we propose FullStack Bench—the first multi-domain, multi-language benchmark specifically designed for full-stack programming, covering foundational programming, data science, software engineering, mathematics, and machine learning. Complementing the benchmark, we introduce SandboxFusion, a sandboxed execution framework supporting 16 programming languages. Our approach innovatively employs real-world development instructions and native, language-specific unit tests—eschewing translation-based cross-lingual evaluation. Experimental results reveal substantial performance gaps among state-of-the-art code LLMs across full-stack tasks. FullStack Bench and SandboxFusion jointly enable efficient, fair, and fine-grained quantification of models’ cross-domain and cross-lingual programming proficiency.

Technology Category

Application Category

📝 Abstract

As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs across diverse full-stack coding domains

Assessing multilingual programming with real-world scenarios

Developing a sandbox tool for efficient code evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive full-stack programming evaluation dataset

Multilingual real-world instructions and test cases

Code sandbox execution tool for diverse languages

🔎 Similar Papers

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions