Improving the Reusability of Conversational Search Test Collections

๐Ÿ“… 2025-03-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Dialogue search test collections suffer from incomplete human annotation (โ€œgapsโ€), leading to biased evaluation of novel systems and hindering fair cross-system comparison. Method: We first identify and characterize the phenomenon of test-set reusability decay at the turn level; then propose a few-shot fine-tuned Llama 3.1โ€“based gap-filling method that automatically retrieves and labels relevant documents missed in human annotations but returned by new systems. Contribution/Results: Evaluated on TREC iKAT 2023 and CAsT 2022, our approach achieves high inter-annotator agreement with human judgments (Cohenโ€™s ฮบ > 0.8). Gap-filling significantly improves ranking stability: the average Spearman rank correlation between regenerated and human-annotated evaluation pools increases by 23.6%. This work establishes a scalable, high-fidelity methodology for building sustainable, reusable dialogue search benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.
Problem

Research questions and friction points this paper is trying to address.

Incomplete relevance judgments limit test collection reusability.
Large Language Models fill holes in conversational search collections.
Fine-tuning Llama 3.1 improves fairness in system comparisons.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Language Models to fill judgment holes.
Fine-tunes Llama 3.1 for high human assessor agreement.
Few-shot prompts ChatGPT for rank correlation improvement.
๐Ÿ”Ž Similar Papers
No similar papers found.