๐ค AI Summary
Large language models (LLMs) exhibit high execution failure rates, semantic inaccuracy, and weak iterative repair capabilities when generating visualization codeโprimarily due to the absence of execution feedback and multi-turn correction supervision in existing instruction-tuning datasets. To address this, we introduce VisCode-200K: the first execution-driven, large-scale visualization instruction-tuning dataset, comprising over 200K code-instruction pairs with rendered images and 45K rounds of execution-feedback-guided multi-turn correction dialogues. Built upon Qwen2.5-Coder-Instruct, our approach integrates code execution verification, rendered-image supervision, and iterative feedback learning. On PandasPlotBench, our method significantly outperforms all open-source baselines and approaches the performance of GPT-4o-mini. Furthermore, self-debugging evaluation demonstrates robust end-to-end repair capability.
๐ Abstract
Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.