π€ AI Summary
Existing code evaluation methods face three key bottlenecks: poor scalability of unit testing, syntactic similarity metrics (e.g., BLEU) failing to capture functional correctness, and semantic metrics (e.g., CodeBERTScore) relying on reference implementations. To address these, we propose MATCHβthe first reference-free, task-description-oriented code evaluation framework based on contrastive learning. Its core innovation lies in constructing a cross-modal semantic space where natural language task descriptions and generated code are jointly embedded and aligned via a multilingual code encoder. Experiments across Python, Java, and C++ benchmarks demonstrate that MATCH significantly improves correlation with functional correctness (+23.6% Pearson) and human judgments (+18.4% Spearman), while maintaining efficiency, cross-language generalizability, and independence from reference code. MATCH establishes a new paradigm for automated, reference-free assessment of AI-generated code.
π Abstract
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.