Visualization Generation with Large Language Models: An Evaluation

📅 2024-01-20

🏛️ arXiv.org

📈 Citations: 44

✨ Influential: 2

career value

163K/year

🤖 AI Summary

Large language models (LLMs) lack systematic evaluation in natural language-to-visualization (NL2VIS) tasks, particularly for generating Vega-Lite specifications. Method: This work introduces the first benchmarking framework dedicated to Vega-Lite generation, built upon the nvBench dataset; it conducts zero-shot and few-shot prompting experiments with GPT-3.5 to rigorously assess NL2VIS capability. Results: GPT-3.5 substantially outperforms traditional methods; few-shot prompting consistently surpasses zero-shot; key failure modes include semantic misinterpretation of data (e.g., incorrect attribute typing) and syntactic/structural errors in Vega-Lite code. The study identifies critical limitations of current LLMs in visualization specification generation and proposes concrete directions for benchmark refinement. It establishes a reproducible, empirically grounded evaluation paradigm for NL2VIS research.

Technology Category

Application Category

📝 Abstract

Analysts frequently need to create visualizations in the data analysis process to obtain and communicate insights. To reduce the burden of creating visualizations, previous research has developed various approaches for analysts to create visualizations from natural language queries. Recent studies have demonstrated the capabilities of large language models in natural language understanding and code generation tasks. The capabilities imply the potential of using large language models to generate visualization specifications from natural language queries. In this paper, we evaluate the capability of a large language model to generate visualization specifications on the task of natural language to visualization (NL2VIS). More specifically, we have opted for GPT-3.5 and Vega-Lite to represent large language models and visualization specifications, respectively. The evaluation is conducted on the nvBench dataset. In the evaluation, we utilize both zero-shot and few-shot prompt strategies. The results demonstrate that GPT-3.5 surpasses previous NL2VIS approaches. Additionally, the performance of few-shot prompts is higher than that of zero-shot prompts. We discuss the limitations of GPT-3.5 on NL2VIS, such as misunderstanding the data attributes and grammar errors in generated specifications. We also summarized several directions, such as correcting the ground truth and reducing the ambiguities in natural language queries, to improve the NL2VIS benchmark.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' performance in generating visualizations from natural language.

Assesses various prompt strategies and chart types using Vega-Lite specifications.

Identifies performance disparities and counterintuitive behaviors to improve NL2VIS benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates six LLMs with eight prompt strategies

Uses nvBench dataset and Vega-Lite specifications

Measures accuracy, validity, legality for chart generation

🔎 Similar Papers

No similar papers found.