TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 3
Influential: 2
📄 PDF
🤖 AI Summary
To address the challenges of evaluating LLM-generated content in enterprise settings—namely, low assessment reliability, high manual annotation costs, and difficulty modeling domain-specific, context-sensitive criteria—this paper proposes a fine-grained, automated evaluation framework based on in-context learning (ICL). Our method decomposes holistic evaluation criteria into interpretable, domain-grounded sub-dimensions (e.g., customer orientation, business safety) via a novel criterion-splitting mechanism. It further introduces a hybrid zero-shot and few-shot prompting paradigm that enables rapid, parameter-free adaptation to new tasks. Additionally, we propose an iterative, engineering-driven approach to shot selection and composition. Experiments across diverse business-critical tasks demonstrate strong correlation (>80%) with human expert ratings—and in several cases, our framework surpasses inter-annotator agreement. The implementation is open-sourced, substantially lowering the barrier to deploying robust, customizable LLM evaluation in vertical domains.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLM), the evaluation of LLM becomes increasingly important. Measuring text generation tasks such as summarization and article creation is very difficult. Especially in specific application domains (e.g., to-business or to-customer service), in-house evaluation criteria have to meet not only general standards (correctness, helpfulness and creativity, etc.) but also specific needs of customers and business security requirements at the same time, making the evaluation more difficult. So far, the evaluation of LLM in business scenarios has mainly relied on manual, which is expensive and time-consuming. In this paper, we propose a model-based evaluation method: TALEC, which allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria. In addition, we try combining zero-shot and few-shot to make the judge model focus on more information. We also propose a prompt paradigm and an engineering approach to adjust and iterate the shots ,helping judge model to better understand the complex criteria. We then compare fine-tuning with ICL, finding that fine-tuning can be replaced by ICL. TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments, outperforming even the inter-human correlation in some tasks. The code is released in https://github.com/zlkqz/auto_eval
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in business domains with specific in-house criteria
Replacing expensive manual evaluation with automated model-based assessment
Teaching judge models complex domain-specific evaluation standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses criteria division for domain-specific evaluation
Combines zero-shot and few-shot in-context learning
Proposes prompt paradigm to understand complex criteria
🔎 Similar Papers
No similar papers found.
Kaiqi Zhang
Kaiqi Zhang
Syracuse University
Artificial IntelligenceDeep Learning
S
Shuai Yuan
ByteDance Inc
H
Honghan Zhao
ByteDance Inc