Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses critical limitations in current evaluation methods for large language models (LLMs) in conversational visual analytics (CVA), which often rely on programming, overlook real-world complexity, and lack interpretable metrics for multimodal text-and-visualization outputs. To bridge this gap, the authors introduce Lexaraβ€”a no-code, multi-format, multi-level evaluation toolkit grounded in user research. Lexara uniquely integrates end-user and developer needs into its assessment framework, proposing interpretable metrics spanning data fidelity, semantic alignment, functional correctness, and visual clarity for visualizations, as well as factual grounding, analytical reasoning, and conversational coherence for language. By combining rule-based checks with an LLM-as-a-Judge paradigm, Lexara enables interactive, code-free evaluation. A two-week diary study with six CVA developers demonstrated that Lexara significantly enhances the efficiency and quality of model and prompt selection decisions.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Conversational Visual Analytics
Model Evaluation
Interpretable Metrics
Multi-format Output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conversational Visual Analytics
Large Language Models
User-Centered Evaluation
Interpretable Metrics
LLM-as-a-Judge
πŸ”Ž Similar Papers
No similar papers found.