VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the lack of an interpretable, quantifiable framework for subjectively perceived “vibe”—including tone, style, and formatting—in large language model (LLM) outputs. We propose the first formal conceptual model of vibe, grounded in iterative prompt mining and an LLM-as-judge crowdsourcing paradigm to enable user-aligned, pairwise modeling of model behavioral differences. Our framework is validated across three diverse tasks: summarization, mathematical reasoning, and multimodal description. Key contributions include: (1) the first rigorously defined, interpretable, evaluable, and user-consistent vibe representation; and (2) an end-to-end automated framework for vibe discovery and quantification. Experiments show 80% accuracy in distinguishing vibes between Llama-3-70B and GPT-4, 61% accuracy in predicting human preference, and systematic characterization of cross-model, cross-task behavioral divergence.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These"vibes"-- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck

Problem

Research questions and friction points this paper is trying to address.

Quantify qualitative differences in LLMs

Discover user-aligned model traits

Compare LLMs using automated vibe analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically compares LLM traits

Utilizes LLM judges for measurement

Validates vibes with human discovery

🔎 Similar Papers

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores