A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study investigates feedback-loop disparities between internal and external human annotators in multi-turn retrieval-augmented generation (RAG) dialogue annotation tasks. Using a longitudinal empirical design—integrating iterative annotation rounds with annotator experience surveys—we systematically analyze trade-offs among dialogue quality, quantity, and diversity across the two annotator groups. Results reveal that tight internal feedback loops significantly improve dialogue quality but reduce quantity and lexical/structural diversity; conversely, looser external feedback enhances diversity at the cost of consistency and coherence. To reconcile these tensions, we propose a “dual-track collaborative annotation framework” that assigns differentiated roles and dynamically modulates feedback frequency, achieving Pareto-optimal balance between quality and diversity. This work presents the first empirically grounded, human-centered strategy for optimizing annotator allocation and annotation workflow design in complex RAG data construction.

Technology Category

Application Category

📝 Abstract

Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

Problem

Research questions and friction points this paper is trying to address.

Comparing internal and external annotator feedback loops

Analyzing conversation quality in multi-turn RAG evaluations

Providing guidance for complex annotation task management

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparing internal and external human annotator feedback loops

Analyzing conversation quality, quantity, and diversity trade-offs

Providing guidance for utilizing different annotator populations

🔎 Similar Papers

No similar papers found.