SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the susceptibility of large language models to hallucination-induced errors in code generation, which often stem from reliance on mental simulation for validation and lead to missing specifications and flawed verification. To mitigate this, the authors propose SolidCoder, a framework grounded in the principle “Don’t imagine—execute directly.” SolidCoder enforces identification of boundary cases prior to algorithm design through S.O.L.I.D. architectural principles and replaces mental simulation with sandboxed execution coupled with property-based oracles. This approach is the first to systematically bridge both specification and validation gaps simultaneously, substantially enhancing the robustness of generated code. Evaluated on HumanEval, CodeContests, and APPS, SolidCoder achieves pass@1 scores of 95.7%, 77.0%, and 26.7%, respectively—improving over baselines by 0.6, 4.3, and 3.4 percentage points—and demonstrates consistent gains even when applied to reinforcement learning–fine-tuned models.

Technology Category

Application Category

📝 Abstract

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Mental-Reality Gap

code generation

hallucination

edge cases

execution verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mental-Reality Gap

SolidCoder

concrete execution