VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models (LLMs) exhibit high execution failure rates, semantic inaccuracy, and weak iterative repair capabilities when generating visualization code—primarily due to the absence of execution feedback and multi-turn correction supervision in existing instruction-tuning datasets. To address this, we introduce VisCode-200K: the first execution-driven, large-scale visualization instruction-tuning dataset, comprising over 200K code-instruction pairs with rendered images and 45K rounds of execution-feedback-guided multi-turn correction dialogues. Built upon Qwen2.5-Coder-Instruct, our approach integrates code execution verification, rendered-image supervision, and iterative feedback learning. On PandasPlotBench, our method significantly outperforms all open-source baselines and approaches the performance of GPT-4o-mini. Furthermore, self-debugging evaluation demonstrates robust end-to-end repair capability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with correct and semantically meaningful visualization code generation

Existing datasets lack execution-grounded supervision for iterative code correction

Need for feedback-driven learning to improve executable visualization code accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs for Python visualization code

Large-scale dataset with execution-grounded supervision

Feedback-driven learning for iterative code correction

🔎 Similar Papers

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

2024-10-05Conference on Empirical Methods in Natural Language ProcessingCitations: 1

Bosch Group

bangalore, IN

Applied Deep Learning PhD Research Intern, Reinforcement Learning for LLMs - Fall 2026

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Machine Learning Engineer - Visual Agents - Special Projects

Apple

Cupertino, United States of America

Machine Learning Engineer, AI Coding Tools

ByteDance

圣何塞

Authors to Follow