🤖 AI Summary
Current LLM-as-a-Judge evaluation methods rely on holistic scoring, lacking fine-grained interpretability and thus failing to pinpoint specific factors influencing judgments. To address this, we propose “Functional Fragmentation,” a novel framework that decomposes LLM outputs into semantic fragments and annotates them according to rhetorical functional roles (e.g., claim, evidence, concession), enabling structured functional parsing. Based on this, we design Evalet—a novel interactive visualization system supporting fragment-level functional comparison across multiple model outputs. Our approach pioneers a shift from opaque, scalar scoring to traceable, attribution-aware behavioral analysis of LLM responses. A user study demonstrates that Evalet improves detection of evaluation inconsistencies by 48%, significantly enhancing both confidence calibration in assessment outcomes and the operationalizability of identified issues.
📝 Abstract
Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.