From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Traditional large language model (LLM) benchmarks often fail to capture real-world user experience, leading practitioners to rely on informal “vibe checks” that lack systematicity and reproducibility. This work formalizes vibe checks into a two-stage evaluation framework: personalized prompt generation grounded in user preferences, followed by subjective perception assessment. The authors instantiate this approach in a prototype benchmark and validate its efficacy through user studies, social media data analysis, and programming task experiments. Results demonstrate that the proposed method significantly alters model rankings compared to conventional metrics, effectively bridging the gap between standardized evaluations and actual user experience. By integrating personalized inputs with subjective judgment, this study introduces a novel evaluation paradigm that better reflects how users interact with and perceive LLMs in practice.

Technology Category

Application Category

📝 Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

vibe-testing

user experience

benchmarking

subjective assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

vibe-testing

personalized evaluation

user-aware criteria

LLM evaluation

subjective benchmarking

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

2024-06-12Citations: 0