Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing RAG evaluation methods lack interpretability and diagnostic capability for complex queries, failing to reveal the underlying reasons for answer quality or pinpoint specific failures. To address this, we propose a novel evaluation framework integrating the human-preference-based Chatbot Arena with nugget-level fine-grained fact decomposition. For the first time, we apply AutoNuggetizer at scale—processing 7K real-world search battle records—to achieve atomic, structured fact decomposition and annotation of RAG responses. Statistical correlation analysis demonstrates that nugget scores exhibit highly significant agreement with human preferences (p < 0.001), confirming strong interpretability and fine-grained diagnostic power. Our method establishes the first attributable, verifiable, and robust automated diagnostic metric for RAG evaluation—enabling precise root-cause analysis of factual inaccuracies, hallucinations, and information omissions in retrieval-augmented generation.

Technology Category

Application Category

📝 Abstract

Battles, or side-by-side comparisons in so called arenas that elicit human preferences, have emerged as a popular approach to assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a"good"response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM responses lacks explanatory and diagnostic capabilities

Nugget methodology assesses RAG answer quality via atomic facts

AutoNuggetizer correlates nugget scores with human preferences automatically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses AutoNuggetizer for automatic RAG evaluation

Decomposes answers into atomic facts (nuggets)

Correlates nugget scores with human preferences

🔎 Similar Papers

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems