When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study investigates the differential impact of semantic perturbations (e.g., paraphrasing, synonym substitution) and surface-level perturbations (e.g., formatting changes, reordering) on the reasoning consistency of large language model (LLM) agents. Through 68 controlled experiments integrating Chain-of-Thought and ReAct frameworks, the authors employ severity-matched perturbations, bootstrap significance testing, cross-architecture generator swapping, and in-depth trajectory analysis across multiple model architectures and tasks to quantitatively assess perturbation effects for the first time. Results show that semantic perturbations increase answer inconsistency by an average of 19.69 percentage points (p<0.0001), a finding validated on qwen2.5-14B-Instruct. The work further uncovers an “implicit divergence” mechanism, wherein semantic perturbations induce reproducible trajectory deviations in intermediate reasoning steps. All data and code are publicly released.

📝 Abstract

We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

Problem

Research questions and friction points this paper is trying to address.

semantic noise

surface noise

LLM agents

reasoning robustness

perturbation sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic noise

surface noise

stealth-divergence