VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit high execution failure rates, semantic inaccuracy, and weak iterative repair capabilities when generating visualization codeโ€”primarily due to the absence of execution feedback and multi-turn correction supervision in existing instruction-tuning datasets. To address this, we introduce VisCode-200K: the first execution-driven, large-scale visualization instruction-tuning dataset, comprising over 200K code-instruction pairs with rendered images and 45K rounds of execution-feedback-guided multi-turn correction dialogues. Built upon Qwen2.5-Coder-Instruct, our approach integrates code execution verification, rendered-image supervision, and iterative feedback learning. On PandasPlotBench, our method significantly outperforms all open-source baselines and approaches the performance of GPT-4o-mini. Furthermore, self-debugging evaluation demonstrates robust end-to-end repair capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with correct and semantically meaningful visualization code generation
Existing datasets lack execution-grounded supervision for iterative code correction
Need for feedback-driven learning to improve executable visualization code accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs for Python visualization code
Large-scale dataset with execution-grounded supervision
Feedback-driven learning for iterative code correction
๐Ÿ”Ž Similar Papers
No similar papers found.