Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study investigates whether large language models can transfer algorithmic reasoning acquired during pretraining to unseen programming languages, revealing a "implementation fidelity gap": while models exhibit language-agnostic semantic understanding, they struggle to correctly implement algorithms in novel languages. To systematically evaluate this, we introduce PyLang, a synthetic programming language, and assess Qwen3 (4B/8B/32B) under zero-shot and fine-tuned settings using techniques including multi-task learning, preference optimization, and latent-space objectives. Results show that fine-tuning improves syntactic proficiency but performance in PyLang remains 19% lower than in Python; in 80% of cases, models select the correct algorithm yet fail in its implementation. CKA analysis reveals highly consistent internal representations (>0.97 similarity) but significant divergence in output layers, confirming a decoupling between semantic comprehension and syntactic realization.

📝 Abstract

Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA>0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

Problem

Research questions and friction points this paper is trying to address.

code generation

language transfer

semantic competence

implementation fidelity gap

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

implementation fidelity gap

unseen programming language

zero-shot code generation