A Single Character can Make or Break Your LLM Evals

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work identifies a previously overlooked critical issue in prompt engineering: the choice of delimiters (e.g., commas, newlines, hash symbols) used to separate in-context examples significantly impacts LLM evaluation outcomes—inducing up to ±23% performance variance on the MMLU benchmark and even reversing model rankings. This phenomenon is pervasive across diverse subject domains and model families (including Llama, Qwen, and Gemma), and persists regardless of model scale. Through systematic ablation experiments and attention-head analysis, the study elucidates how delimiter selection disrupts the model’s attentional focus on task-critical input tokens. To address this, the authors propose “explicit delimiter prompting”—a novel prompting strategy that explicitly instructs the model to treat delimiters as structural markers rather than semantic content—thereby substantially improving robustness. Empirical evaluation yields evidence-based best practices for delimiter selection, offering actionable guidance for reliable and reproducible in-context learning.

Technology Category

Application Category

📝 Abstract

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $pm 23%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

Problem

Research questions and friction points this paper is trying to address.

Minor delimiter changes drastically affect LLM evaluation performance

Model rankings can be manipulated through delimiter selection alone

LLMs show brittleness across topics and scales without robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studied delimiter choice impact on LLM evaluations

Probed attention mechanisms to identify effective separators

Proposed delimiter specification in prompts for robustness

🔎 Similar Papers

No similar papers found.