Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a spurious correlation between syntactic structures and task domains that emerges during language model training, causing models to over-rely on superficial syntactic patterns rather than semantic content when executing instructions—thereby degrading task performance and undermining safety fine-tuning. We present the first systematic identification, quantification, and evaluation of this phenomenon, introducing a detection framework based on part-of-speech (POS) templates and synthetically generated data, applicable across diverse models including OLMo, Llama, and GPT-4o. Empirical results demonstrate that this spurious correlation significantly reduces accuracy on entity-knowledge tasks and can be exploited as a generalizable safety bypass: altering syntax alone—without changing semantics—suffices to evade refusal mechanisms. Our core contributions are the formal establishment of syntax–domain spurious correlation as a real, detectable phenomenon and the demonstration of its dual threat to both robustness and alignment safety.

Technology Category

Application Category

📝 Abstract
For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates--frequent sequences of Part-of-Speech (PoS) tags--are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.
Problem

Research questions and friction points this paper is trying to address.

Models learn spurious correlations between syntax and domain during training
Syntactic-domain correlations can override prompt semantics and lower performance
These unintended correlations can bypass safety measures in fine-tuned models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects spurious correlations between syntax and domain
Uses synthetic training dataset to evaluate model performance
Proposes framework to test for unintended syntactic associations
🔎 Similar Papers
No similar papers found.