MATCH: Task-Driven Code Evaluation through Contrastive Learning

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing code evaluation methods face three key bottlenecks: poor scalability of unit testing, syntactic similarity metrics (e.g., BLEU) failing to capture functional correctness, and semantic metrics (e.g., CodeBERTScore) relying on reference implementations. To address these, we propose MATCH—the first reference-free, task-description-oriented code evaluation framework based on contrastive learning. Its core innovation lies in constructing a cross-modal semantic space where natural language task descriptions and generated code are jointly embedded and aligned via a multilingual code encoder. Experiments across Python, Java, and C++ benchmarks demonstrate that MATCH significantly improves correlation with functional correctness (+23.6% Pearson) and human judgments (+18.4% Spearman), while maintaining efficiency, cross-language generalizability, and independence from reference code. MATCH establishes a new paradigm for automated, reference-free assessment of AI-generated code.

Technology Category

Application Category

📝 Abstract

AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI-generated code alignment with developer intent without references

Overcoming limitations of traditional code evaluation methods like unit tests

Addressing shortcomings of syntactic metrics in capturing code functionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses contrastive learning for code embeddings

Generates reference-free similarity scoring

Evaluates code-task alignment without unit tests

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells