Generate, Evaluate, Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current LLM-as-a-judge paradigms suffer from insufficient diversity and weak representativeness in evaluation data, hindering robust and interpretable standard optimization. To address this, we propose a human-AI collaborative evaluation optimization framework that deeply integrates LLM-driven synthetic data generation into a human feedback loop. This enables on-demand construction of boundary cases, multi-dimensional test case configuration, and AI-assisted editing with prompt/explanation visualization—facilitating efficient, transparent iterative refinement of evaluation criteria. A user study shows 83% preference for our tool; synthetically generated data matches human-annotated data in both optimization efficacy and alignment with human preferences, while substantially improving construction efficiency and system scalability. Our core contributions are (1) a closed-loop synergy mechanism between synthetic data generation and human judgment, and (2) an interpretability-enhanced design supporting traceable, explainable standard tuning.

Technology Category

Application Category

📝 Abstract

The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both refining evaluation criteria and aligning with human preferences. These findings highlight synthetic data as a promising alternative, particularly in contexts where efficiency and scalability are critical.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited diverse data for refining LLM evaluation criteria

Enables creation of tailored test cases with configurable parameters

Provides scalable synthetic data alternative to manual case creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for LLM judge refinement

AI-assisted inline editing of test cases

Transparent prompts and explanations for generations

🔎 Similar Papers

No similar papers found.