OSS-Bench: Benchmark Generator for Coding LLMs

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing LLM code evaluation benchmarks rely on static datasets, lack memory-safety validation, and suffer from poor scalability. Method: We propose the first dynamic benchmark generation framework tailored for programming LLMs, automatically extracting function-level tasks from real-world open-source projects. Evaluation employs a tripartite metric: compilation success rate, functional correctness, and memory safety—verified via AddressSanitizer (ASan) and UndefinedBehaviorSanitizer (UBSan) diagnostics. Our approach integrates program analysis, multi-language AST parsing, compiler toolchains (Clang/GCC), and runtime memory-safety instrumentation to mitigate overfitting and enable scalable ground-truth construction. Contribution/Results: Evaluating 17 state-of-the-art code-generation LLMs on PHP and SQL subsets reveals systematic deficiencies in memory-safety capabilities; model scaling yields negligible improvement in this dimension. This work establishes a new paradigm for trustworthy code generation evaluation.

Technology Category

Application Category

📝 Abstract

In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual effort to create static datasets, rely on indirect or insufficiently challenging tasks, depend on non-scalable ground truth, or neglect critical low-level security evaluations, particularly memory-safety issues. In this work, we introduce OSS-Bench, a benchmark generator that automatically constructs large-scale, live evaluation tasks from real-world open-source software. OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety, leveraging robust signals like compilation failures, test-suite violations, and sanitizer alerts as ground truth. In our evaluation, the benchmark, instantiated as OSS-Bench(php) and OSS-Bench(sql), profiles 17 diverse LLMs, revealing insights such as intra-family behavioral patterns and inconsistencies between model size and performance. Our results demonstrate that OSS-Bench mitigates overfitting by leveraging the evolving complexity of OSS and highlights LLMs' limited understanding of low-level code security via extended fuzzing experiments. Overall, OSS-Bench offers a practical and scalable framework for benchmarking the real-world coding capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated code quality robustly

Addressing limitations in existing static benchmarks

Assessing low-level security and memory safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generates live evaluation tasks from open-source software

Evaluates code using compilability, correctness, and memory safety

Leverages compilation failures and sanitizer alerts as ground truth

🔎 Similar Papers

No similar papers found.