Towards Understanding Bias in Synthetic Data for Evaluation

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This paper systematically reveals that LLM-generated synthetic test collections introduce significant bias in information retrieval (IR) evaluation: while distorting absolute metrics (e.g., nDCG@10), they preserve relative system rankings with remarkable robustness. To quantify bias sources, the study pioneers the use of linear mixed-effects modeling to disentangle three synthesis scenarios—query-only, label-only, and joint query-label generation—and validates bias measurability and modelability via large-scale empirical analysis. The core contributions are: (1) establishing the validity boundary of synthetic test collections for relative IR evaluation; (2) proposing an interpretable, model-based framework for bias attribution; and (3) open-sourcing all code, data, and synthetic test collections to enhance reproducibility and credibility in IR evaluation.

Technology Category

Application Category

📝 Abstract

Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.

Problem

Research questions and friction points this paper is trying to address.

Investigates bias in synthetic test collections for IR evaluation

Examines reliability of LLM-generated queries and relevance judgments

Analyzes impact of bias on absolute vs relative system performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to generate synthetic test collections

Investigating biases in synthetic evaluation data

Validating bias effects with mixed-effects models

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models