Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current LLM-as-a-Judge evaluation methods rely on holistic scoring, lacking fine-grained interpretability and thus failing to pinpoint specific factors influencing judgments. To address this, we propose “Functional Fragmentation,” a novel framework that decomposes LLM outputs into semantic fragments and annotates them according to rhetorical functional roles (e.g., claim, evidence, concession), enabling structured functional parsing. Based on this, we design Evalet—a novel interactive visualization system supporting fragment-level functional comparison across multiple model outputs. Our approach pioneers a shift from opaque, scalar scoring to traceable, attribution-aware behavioral analysis of LLM responses. A user study demonstrates that Evalet improves detection of evaluation inconsistencies by 48%, significantly enhancing both confidence calibration in assessment outcomes and the operationalizability of identified issues.

Technology Category

Application Category

📝 Abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM outputs with holistic scores obscures specific elements

Identifying which output fragments influence evaluation assessments is challenging

Practitioners struggle to validate and trust LLM-as-a-Judge evaluation results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Functional fragmentation dissects outputs into key fragments

Evalet visualizes fragment-level functions for inspection

Shifts evaluation from quantitative scores to qualitative analysis

🔎 Similar Papers

No similar papers found.