EvalAssist: A Human-Centered Tool for LLM-as-a-Judge

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address the high cost, inconsistent criteria, and low efficiency of evaluating LLM outputs—particularly in multi-model and multi-prompt selection—this paper proposes a human-centered automated evaluation framework. Methodologically: (1) we design an interactive, web-based environment for developing structured, shareable evaluation specifications; (2) we integrate prompt chaining with a dedicated harmful-content detection model to implement a lightweight, portable LLM-as-a-judge pipeline within the open-source UNITXT library; (3) the framework requires no fine-tuning, relying solely on off-the-shelf LLMs and custom prompt engineering. Deployed internally, it serves hundreds of users, substantially reducing human evaluation effort and turnaround time. Empirical adoption demonstrates improved consistency, reproducibility, and standardization across diverse models and tasks.

Technology Category

Application Category

📝 Abstract

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. We support a set of LLM-based evaluation pipelines that leverage off-the-shelf LLMs and use a prompt-chaining approach we developed and contributed to the UNITXT open-source library. Additionally, our system also includes specially trained evaluators to detect harms and risks in LLM outputs. We have deployed the system internally in our organization with several hundreds of users.

Problem

Research questions and friction points this paper is trying to address.

Simplifying evaluation of diverse LLM outputs for tasks

Reducing time and cost in LLM-as-a-judge workflows

Detecting harms and risks in LLM outputs effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive online criteria development environment

LLM-based evaluation pipelines with prompt-chaining

Specially trained evaluators for harm detection

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks