Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-as-a-Judge evaluation methods rely on holistic scoring, lacking fine-grained interpretability and thus failing to pinpoint specific factors influencing judgments. To address this, we propose “Functional Fragmentation,” a novel framework that decomposes LLM outputs into semantic fragments and annotates them according to rhetorical functional roles (e.g., claim, evidence, concession), enabling structured functional parsing. Based on this, we design Evalet—a novel interactive visualization system supporting fragment-level functional comparison across multiple model outputs. Our approach pioneers a shift from opaque, scalar scoring to traceable, attribution-aware behavioral analysis of LLM responses. A user study demonstrates that Evalet improves detection of evaluation inconsistencies by 48%, significantly enhancing both confidence calibration in assessment outcomes and the operationalizability of identified issues.

Technology Category

Application Category

📝 Abstract
Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM outputs with holistic scores obscures specific elements
Identifying which output fragments influence evaluation assessments is challenging
Practitioners struggle to validate and trust LLM-as-a-Judge evaluation results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Functional fragmentation dissects outputs into key fragments
Evalet visualizes fragment-level functions for inspection
Shifts evaluation from quantitative scores to qualitative analysis
🔎 Similar Papers
No similar papers found.
T
Tae Soo Kim
School of Computing, KAIST
H
Heechan Lee
School of Computing, KAIST
Yoonjoo Lee
Yoonjoo Lee
KAIST
Human Computer InteractionNatural Language Processing
Joseph Seering
Joseph Seering
School of Computing, KAIST
Social ComputingHuman-Computer InteractionSocial Agents
J
Juho Kim
School of Computing, KAIST