Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Context-aware code completion models often lack robust syntactic awareness, posing a critical challenge for generating grammatically correct code. Method: This work introduces the first systematic evaluation of syntactic awareness in large language models (LLMs)—specifically Gemini and GPT series—using the SAFIM benchmark. We propose dynamic context window truncation and syntactic structure alignment, coupled with cosine similarity for semantic consistency assessment, and end-to-end measurement of latency and syntax error rate. Contribution/Results: (1) We establish the first evaluation framework for context-aware code completion explicitly focused on syntactic correctness; (2) we uncover an intrinsic accuracy–latency trade-off; (3) empirical results show GPT-4o achieves optimal balance between accuracy and response speed, whereas Gemini 1.5 Flash, though lowest in latency, incurs significantly higher syntax error rates. These findings provide empirical guidance for model selection and optimization in syntax-sensitive code generation tasks.

Technology Category

Application Category

📝 Abstract

The advent of Large Language Models (LLMs) has revolutionized code completion, transforming it into a more intelligent and context-aware feature in modern integrated development environments. These advancements have significantly enhanced developers' ability to write efficient and error-free code. This study evaluates the performance of several chat-based LLMs, including Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, GPT-4o-mini, and GPT-4 Turbo, using the Syntax-Aware Fill-in-the-Middle (SAFIM) dataset. This benchmark is specifically designed to assess models' capabilities in syntax-sensitive code generation. Performance metrics, such as cosine similarity with ground-truth completions and latency, were employed to measure both accuracy and efficiency. The findings reveal substantial differences in the models' code completion abilities, offering valuable insights into their respective strengths and weaknesses. This work provides a comparative analysis that underscores the trade-offs between accuracy and speed, establishing a benchmark for future advancements in LLM-based code completion.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs for code completion using SAFIM dataset.

Measures model accuracy and efficiency in syntax-aware tasks.

Compares trade-offs between accuracy and speed in LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAFIM Framework assesses syntax-aware code generation

Performance metrics include cosine similarity and latency

Comparative analysis of LLMs for code completion

🔎 Similar Papers

Retrieval-augmented code completion for local projects using large language models