The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the overlooked vulnerability of large language models (LLMs) to imperfect text with disrupted word boundaries, where performance degrades non-monotonically—a phenomenon absent in standard clean benchmarks. By systematically inserting spaces within words to induce controlled perturbations, the authors uncover and name the “textual uncanny valley”: a U-shaped performance curve wherein moderate noise yields the worst results. They propose a hypothesis of competing word-level and character-level processing modes and validate it through in-context learning probes, regularized perturbations, cross-task transfer, and tokenization entropy analysis. Findings show that stronger models or tasks with lower lexical dependency mitigate this effect, and crucially, peaks in tokenization entropy precede performance troughs, supporting the mode-conflict interpretation.

📝 Abstract

Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.

Problem

Research questions and friction points this paper is trying to address.

Text Uncanny Valley

word-boundary corruption

LLM information retrieval

noisy text

performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Uncanny Valley

word-boundary corruption

mode transition hypothesis