Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-chart generation suffers from high code execution failure rates (~15%), semantic hallucinations, and poor color accessibility for users with color vision deficiency. Method: This paper proposes the first lightweight multi-agent framework for text-to-chart generation, decoupling the process into four stages—draft generation, execution validation, error repair, and quality assessment—using only the GPT-4o-mini model. Contribution/Results: We identify single-prompt design—not model capability—as the primary bottleneck for execution failures. Our work pioneers multi-agent collaboration for this task and extends evaluation beyond executability to include aesthetics, semantic fidelity, and accessibility. On Text2Chart31 and ChartX benchmarks, execution error rates drop to 4.5% and 4.6%, respectively. Human evaluation reveals that only 33.3% and 7.2% of generated charts are color-vision-friendly, underscoring the necessity of our expanded evaluation criteria.

Technology Category

Application Category

📝 Abstract
Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the extsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the extsc{ChartX} benchmark, with an error rate of 4.6%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3% ( extsc{Text2Chart31}) and 7.2% ( extsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.
Problem

Research questions and friction points this paper is trying to address.

Reducing execution errors in text-to-chart generation
Addressing hallucinations in generated chart content
Improving accessibility and aesthetics of LLM-generated charts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent pipeline separates drafting, execution, repair, judgment
Uses off-the-shelf GPT-4o-mini model for efficiency
Reduces execution errors to 4.5% in three iterations
🔎 Similar Papers
No similar papers found.