Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Context-aware code completion models often lack robust syntactic awareness, posing a critical challenge for generating grammatically correct code. Method: This work introduces the first systematic evaluation of syntactic awareness in large language models (LLMs)—specifically Gemini and GPT series—using the SAFIM benchmark. We propose dynamic context window truncation and syntactic structure alignment, coupled with cosine similarity for semantic consistency assessment, and end-to-end measurement of latency and syntax error rate. Contribution/Results: (1) We establish the first evaluation framework for context-aware code completion explicitly focused on syntactic correctness; (2) we uncover an intrinsic accuracy–latency trade-off; (3) empirical results show GPT-4o achieves optimal balance between accuracy and response speed, whereas Gemini 1.5 Flash, though lowest in latency, incurs significantly higher syntax error rates. These findings provide empirical guidance for model selection and optimization in syntax-sensitive code generation tasks.

Technology Category

Application Category

📝 Abstract
The advent of Large Language Models (LLMs) has revolutionized code completion, transforming it into a more intelligent and context-aware feature in modern integrated development environments. These advancements have significantly enhanced developers' ability to write efficient and error-free code. This study evaluates the performance of several chat-based LLMs, including Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, GPT-4o-mini, and GPT-4 Turbo, using the Syntax-Aware Fill-in-the-Middle (SAFIM) dataset. This benchmark is specifically designed to assess models' capabilities in syntax-sensitive code generation. Performance metrics, such as cosine similarity with ground-truth completions and latency, were employed to measure both accuracy and efficiency. The findings reveal substantial differences in the models' code completion abilities, offering valuable insights into their respective strengths and weaknesses. This work provides a comparative analysis that underscores the trade-offs between accuracy and speed, establishing a benchmark for future advancements in LLM-based code completion.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs for code completion using SAFIM dataset.
Measures model accuracy and efficiency in syntax-aware tasks.
Compares trade-offs between accuracy and speed in LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAFIM Framework assesses syntax-aware code generation
Performance metrics include cosine similarity and latency
Comparative analysis of LLMs for code completion
🔎 Similar Papers
No similar papers found.
H
Hang Zhang
University of California San Diego, California, USA
Yanxin Shen
Yanxin Shen
Unknown affiliation
Lun Wang
Lun Wang
Google Deepmind
LLM post-trainingMultimodal LLMLLM safety
C
Chuanqi Shi
University of California San Diego, California, USA
S
Shaoshuai Du
University of Amsterdam, Amsterdam, Netherlands
Yiyi Tao
Yiyi Tao
Peking University
Machine LearningArtificial IntelligenceTrustworthy AI
Yixian Shen
Yixian Shen
University of Amsterdam
Efficient DNNComputer ArchitectureSystem Optimization