๐ค AI Summary
This study investigates the generalizability of the UM-BRELA LLM Judge framework across diverse large language models (LLMs), specifically examining how the choice of judge model affects the accuracy of relevance assessment.
Method: We conduct the first cross-model empirical evaluation beyond GPT-4oโspanning mainstream open and closed models including DeepSeek-V3 and LLaMA-3.3 (70B/8B)โusing two quantitative metrics: leaderboard ranking correlation and per-label agreement.
Contribution/Results: UM-BRELA achieves performance on DeepSeek-V3 comparable to that on GPT-4o; however, accuracy declines moderately on LLaMA-3.3-70B and drops significantly as model scale decreases. These findings reveal the critical influence of judge model capability boundaries on evaluation robustness. The work provides empirical evidence and methodological guidance for model adaptability and trustworthiness in LLM evaluation frameworks, advancing principled design of scalable, reliable automated assessment systems.
๐ Abstract
We reproduce the UMBRELA LLM Judge evaluation framework across a range of large language models (LLMs) to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.