EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing code generation benchmarks, which often conflate genuine reasoning capabilities of large language models with memorization of training data. To this end, the authors propose the first evaluation benchmark based on five esoteric programming languages—such as Brainfuck and Whitespace—that are unlikely to appear in standard training corpora. By incorporating documentation-based learning, interpreter feedback, and iterative experimentation, the benchmark emulates human-like learning processes and effectively mitigates data contamination, thereby isolating transferable reasoning abilities. Experimental results reveal a stark performance gap: while state-of-the-art models achieve high scores (85–95%) on conventional benchmarks, they attain only 0–11% on this new benchmark and consistently fail to solve tasks of moderate or higher difficulty. This highlights a critical deficiency in their generalization and reasoning capacity, offering a novel paradigm for evaluating true model intelligence.

Technology Category

Application Category

📝 Abstract
Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.
Problem

Research questions and friction points this paper is trying to address.

reasoning
code generation
benchmark
esoteric programming languages
memorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

esoteric programming languages
genuine reasoning
benchmark contamination
transferable reasoning
language model evaluation
🔎 Similar Papers
No similar papers found.