Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited robustness and generalization of large language models (LLMs) in code generation. We propose the first systematic integration of ensemble learning into this task, introducing a consensus-driven voting framework that requires no fine-tuning or reinforcement learning. Our method generates candidate programs in parallel across multiple LLMs and employs a weighted, structured voting mechanism grounded in three complementary similarity dimensions: syntactic (CodeBLEU), semantic, and behavioral equivalence (assessed via CrossHair). Evaluated on HumanEval and LiveCodeBench, our approach achieves 90.2% (a +6.7% improvement over GPT-4o) and 50.2% (+6.8%), respectively. Notably, even with entirely open-source model ensembles, it attains 80.5% on HumanEval and 41.6% on LiveCodeBench—demonstrating strong efficacy and broad applicability under resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Ensemble learning has been widely used in machine learning to improve model robustness, accuracy, and generalization, but has not yet been applied to code generation tasks with large language models (LLMs). We propose an ensemble approach for LLMs in code generation. Instead of relying on the output of a single model, we generate multiple candidate programs from different LLMs and apply a structured voting mechanism to select the most reliable solution. For voting, we compute syntactic and semantic similarity using CodeBLEU and behavioral equivalence using CrossHair's differential behavior analysis. By aggregating these similarity scores, we select the program that best aligns with the consensus among the candidates. We show through experiments that our ensemble approach consistently outperforms standalone LLMs on the well-known HumanEval and the more challenging LiveCodeBench datasets, achieving an accuracy of 90.2% and 50.2%, respectively, on the two datasets. In comparison, the best-performing LLM (GPT-4o) has an accuracy of 83.5% and 43.4%, respectively. Furthermore, even when restricted to free open-source models, our method achieves an accuracy of 80.5% and 41.6%, respectively, demonstrating the viability of our approach in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Improves code generation accuracy using ensemble learning with LLMs.
Selects reliable solutions via syntactic, semantic, and behavioral similarity.
Outperforms standalone LLMs on HumanEval and LiveCodeBench datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble learning for LLM code generation
Structured voting with CodeBLEU and CrossHair
Improved accuracy on HumanEval and LiveCodeBench
🔎 Similar Papers
No similar papers found.