What can Large Language Models Capture about Code Functional Equivalence?

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work investigates large language models’ (LLMs) semantic understanding of functional equivalence in code. To this end, the authors introduce SeqCoBench—the first fine-grained, functionally grounded evaluation benchmark for code equivalence—comprising over 20 semantically preserving and breaking Python code transformations. They systematically evaluate state-of-the-art Code-LLMs via zero-shot reasoning, parameter-efficient fine-tuning (PEFT), embedding similarity analysis, and multi-perturbation robustness testing. Results reveal that current models perform near random chance on functional equivalence discrimination, substantially underperforming classical symbolic methods—indicating their reliance on shallow syntactic patterns rather than deep behavioral semantics. This study establishes the first quantifiable, perturbation-robust evaluation paradigm for functional equivalence, offering a novel, empirically grounded standard for assessing semantic reasoning capabilities in code LLMs.

Technology Category

Application Category

📝 Abstract

Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code)-LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.

Problem

Research questions and friction points this paper is trying to address.

Assessing Code-LLMs' semantic understanding

Evaluating functional equivalence in Python programs

Comparing LLMs and classical methods on code semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

SeqCoBench for functional equivalence

Zero-shot and finetuning evaluations

Code-LLMs vs classical methods

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates