SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) frequently generate syntactically valid but semantically incorrect code—i.e., code that compiles yet exhibits erroneous behavior. Existing post-hoc repair methods rely on program execution and test-based feedback, suffering from coarse-grained error localization, high latency, and incomplete test coverage. To address this, we propose SemGuard, the first framework enabling *line-level, real-time semantic supervision and correction during code generation*. SemGuard integrates a semantic evaluator to guide constrained decoding in autoregressive generation, dynamically detecting and rolling back semantically deviant lines—without requiring execution or test cases. To support this, we introduce SemDiff, the first benchmark dataset with precise line-level semantic annotations. Experiments show that SemGuard reduces semantic error rate by 19.86% over ROCODE on SemDiff and improves Pass@1 on LiveCodeBench by 48.92%. Moreover, it demonstrates strong generalization across LLMs and programming languages.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the autoregressive decoding process, intervening while the code is being generated is a direct way to stop error propagation. Constrained-decoding approaches such as ROCODE attempt this, but still wait until the entire program runs to obtain feedback and use entropy heuristics that do not truly capture semantics. A more effective solution must inject semantic signals-early and precisely-into the decoding process.We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision. To train the evaluator, we build SemDiff, the first dataset with fine-grained annotations that mark the exact line where a correct and an incorrect implementation diverge. The evaluator, once embedded in the LLM's decoder, flags deviations on partial code, rolls back to the faulty line, and guides regeneration-without executing the program or requiring test cases. Across four benchmarks, SemGuard consistently outperforms state-of-the-art baselines. It lowers the semantic error rate by 19.86% on SemDiff relative to ROCODE, and lifts Pass@1 by 48.92% on the real-world LiveCodeBench with CodeLlama-7B. Similar gains hold for StarCoder2-7B on MBPP and for DeepSeekCoder-6.7B on the Java benchmark SemDiff-Java, demonstrating model- and language-agnostic effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Detecting semantic errors in LLM-generated code during decoding

Reducing latency by intervening before program execution

Providing real-time semantic feedback without requiring test cases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time semantic supervision during code generation

Rollback and regeneration at faulty lines without execution

Model-agnostic semantic evaluator trained with fine-grained divergence annotations

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?