Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit fragile generalization under minor input perturbations—such as stem formatting changes or distractor lengthening—where high benchmark scores mask overreliance on superficial cues. Method: We propose “generalization stress testing,” a novel evaluation paradigm that systematically quantifies performance degradation under controlled perturbations to format, vocabulary, and irrelevant content. Using standard benchmarks (e.g., MMLU), we conduct rigorous ablation experiments across diverse models—including Qwen 2.5-1.5B and GPT-4—to isolate bias sources. Contribution/Results: Results reveal severe generalization deficits: Qwen 2.5-1.5B suffers a 53-point MMLU drop when distractors are lengthened; GPT-4 incurs a 25-point accuracy decline under question-type transformations—demonstrating the ubiquity of such vulnerabilities. This work challenges the validity of conventional evaluations and establishes a methodological foundation for robustness assessment and model improvement.

Technology Category

Application Category

📝 Abstract

This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs, specifically focusing on minor perturbations in well-established benchmarks (e.g., slight changes in question format or distractor length). Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT-4 experiences a 25-point accuracy loss when question types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts. This work aligns with the ACL 2025 theme track on the Generalization of NLP models, proposing a"Generalization Stress Test"to assess performance shifts under controlled perturbations. The study calls for reevaluating benchmarks and developing more reliable evaluation methodologies to capture LLM generalization abilities better.

Problem

Research questions and friction points this paper is trying to address.

LLM fragility in generalization

impact of minor input modifications

need for robust evaluation methodologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs fragility in generalization

benchmarks minor perturbations impact

Generalization Stress Test proposal

🔎 Similar Papers

No similar papers found.

Authors to Follow