Does UMBRELA Work on Other LLMs?

๐Ÿ“… 2025-07-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the generalizability of the UM-BRELA LLM Judge framework across diverse large language models (LLMs), specifically examining how the choice of judge model affects the accuracy of relevance assessment. Method: We conduct the first cross-model empirical evaluation beyond GPT-4oโ€”spanning mainstream open and closed models including DeepSeek-V3 and LLaMA-3.3 (70B/8B)โ€”using two quantitative metrics: leaderboard ranking correlation and per-label agreement. Contribution/Results: UM-BRELA achieves performance on DeepSeek-V3 comparable to that on GPT-4o; however, accuracy declines moderately on LLaMA-3.3-70B and drops significantly as model scale decreases. These findings reveal the critical influence of judge model capability boundaries on evaluation robustness. The work provides empirical evidence and methodological guidance for model adaptability and trustworthiness in LLM evaluation frameworks, advancing principled design of scalable, reliable automated assessment systems.

Technology Category

Application Category

๐Ÿ“ Abstract
We reproduce the UMBRELA LLM Judge evaluation framework across a range of large language models (LLMs) to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.
Problem

Research questions and friction points this paper is trying to address.

Assess UMBRELA framework generalizability across different LLMs
Evaluate LLM choice impact on relevance assessment accuracy
Compare performance of UMBRELA with various LLM models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reproducing UMBRELA framework across multiple LLMs
Evaluating LLM choice impact on relevance accuracy
Comparing DeepSeek V3 and GPT-4o performance
๐Ÿ”Ž Similar Papers
No similar papers found.