LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of inflated performance estimates in LLM reasoning evaluation due to benchmark data exposure. To mitigate memorization bias, we introduce the first memory-resistant linguistic reasoning benchmark. Methodologically, we innovatively combine linguistically grounded templated generation with orthographic confusion across writing systems—leveraging controlled symbol substitution and cross-script consistency modeling to dynamically generate semantically equivalent yet representationally novel question variants, thereby effectively isolating training-data contamination. Empirical evaluation reveals that leading models—including OpenAI o1-preview and DeepSeek R1—exhibit an average 18.7% accuracy drop on confused text versus native script, demonstrating for the first time that their reasoning capabilities are substantially overestimated and critically dependent on surface-level textual representations. Our framework establishes a scalable, interpretable paradigm for debiased reasoning assessment.

Technology Category

Application Category

📝 Abstract
Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.
Problem

Research questions and friction points this paper is trying to address.

Reducing memorisation impact on LLM reasoning evaluation
Creating linguistic reasoning problems with orthographic obfuscation
Assessing LLM reasoning variance across problem permutations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linguistic templatisation reduces memorisation effects.
Orthographic obfuscation generates diverse question variations.
Framework challenges LLMs' reasoning beyond data exposure.
🔎 Similar Papers
No similar papers found.
Jude Khouja
Jude Khouja
University of Oxford
Natural Language ProcessingMachine LearningComputational Social Science
Karolina Korgul
Karolina Korgul
Oxford Internet Institute, University of Oxford
AI SafetyAI AgentsEvals
S
Simi Hellsten
United Kingdom Linguistics Olympiad, University of Glasgow, Glasgow, United Kingdom
Lingyi Yang
Lingyi Yang
University of Oxford
Machine LearningTime seriesControl
V
Vlad Neacsu
National University of Science and Technology POLITEHNICA Bucharest, Romania
H
Harry Mayne
University of Oxford, Oxford, United Kingdom
R
Ryan Kearns
University of Oxford, Oxford, United Kingdom
A
Andrew Bean
University of Oxford, Oxford, United Kingdom
Adam Mahdi
Adam Mahdi
Associate Professor, University of Oxford
large language modelsmultimodal AIdigital health