🤖 AI Summary
This work investigates whether natural language inference (NLI) data generated by large language models (LLMs)—specifically GPT-4, Llama-2-70b, and Mistral-7b—inherit annotation artifacts and societal biases (e.g., gender, race, age) present in human-annotated NLI datasets. We construct LLM-generated NLI subsets and empirically identify severe hypothesis exclusivity bias and stereotypical social biases—first such evidence in synthetic NLI data. To detect these biases systematically, we propose a dual-path framework: (1) a fine-tuned BERT-based hypothesis exclusivity classifier, and (2) pointwise mutual information (PMI) analysis for bias-associated lexical patterns. Experiments show the framework achieves 86–96% classification accuracy on LLM-generated data—significantly outperforming its performance on human-annotated data—and quantitatively identifies multiple bias-correlated lexical terms. Our findings provide both novel methodology and critical empirical evidence for bias assessment and mitigation in LLM-synthesized training data.
📝 Abstract
We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use pointwise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.