SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

207K/year
๐Ÿค– AI Summary
Large language models (LLMs) frequently generate syntactically valid but semantically incorrect codeโ€”i.e., code that compiles yet exhibits erroneous behavior. Existing post-hoc repair methods rely on program execution and test-based feedback, suffering from coarse-grained error localization, high latency, and incomplete test coverage. To address this, we propose SemGuard, the first framework enabling *line-level, real-time semantic supervision and correction during code generation*. SemGuard integrates a semantic evaluator to guide constrained decoding in autoregressive generation, dynamically detecting and rolling back semantically deviant linesโ€”without requiring execution or test cases. To support this, we introduce SemDiff, the first benchmark dataset with precise line-level semantic annotations. Experiments show that SemGuard reduces semantic error rate by 19.86% over ROCODE on SemDiff and improves Pass@1 on LiveCodeBench by 48.92%. Moreover, it demonstrates strong generalization across LLMs and programming languages.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the autoregressive decoding process, intervening while the code is being generated is a direct way to stop error propagation. Constrained-decoding approaches such as ROCODE attempt this, but still wait until the entire program runs to obtain feedback and use entropy heuristics that do not truly capture semantics. A more effective solution must inject semantic signals-early and precisely-into the decoding process.We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision. To train the evaluator, we build SemDiff, the first dataset with fine-grained annotations that mark the exact line where a correct and an incorrect implementation diverge. The evaluator, once embedded in the LLM's decoder, flags deviations on partial code, rolls back to the faulty line, and guides regeneration-without executing the program or requiring test cases. Across four benchmarks, SemGuard consistently outperforms state-of-the-art baselines. It lowers the semantic error rate by 19.86% on SemDiff relative to ROCODE, and lifts Pass@1 by 48.92% on the real-world LiveCodeBench with CodeLlama-7B. Similar gains hold for StarCoder2-7B on MBPP and for DeepSeekCoder-6.7B on the Java benchmark SemDiff-Java, demonstrating model- and language-agnostic effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Detecting semantic errors in LLM-generated code during decoding
Reducing latency by intervening before program execution
Providing real-time semantic feedback without requiring test cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time semantic supervision during code generation
Rollback and regeneration at faulty lines without execution
Model-agnostic semantic evaluator trained with fine-grained divergence annotations