EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the limitation of existing code generation benchmarks, which often conflate genuine reasoning capabilities of large language models with memorization of training data. To this end, the authors propose the first evaluation benchmark based on five esoteric programming languages—such as Brainfuck and Whitespace—that are unlikely to appear in standard training corpora. By incorporating documentation-based learning, interpreter feedback, and iterative experimentation, the benchmark emulates human-like learning processes and effectively mitigates data contamination, thereby isolating transferable reasoning abilities. Experimental results reveal a stark performance gap: while state-of-the-art models achieve high scores (85–95%) on conventional benchmarks, they attain only 0–11% on this new benchmark and consistently fail to solve tasks of moderate or higher difficulty. This highlights a critical deficiency in their generalization and reasoning capacity, offering a novel paradigm for evaluating true model intelligence.

Technology Category

Application Category

📝 Abstract

Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.

Problem

Research questions and friction points this paper is trying to address.

reasoning

code generation

benchmark

esoteric programming languages

memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

esoteric programming languages

genuine reasoning

benchmark contamination