🤖 AI Summary
While large language models excel on single-skill benchmarks (e.g., programming, mathematics, VQA), their capability in multi-skill tasks requiring joint spatial planning, foundational programming, and logical reasoning remains poorly understood. Method: We introduce XLogoBench—the first vision-programming-oriented program synthesis benchmark—comprising 85 real-world, multi-skill-coupled tasks derived from XLogoOnline. We propose (1) a simulator-feedback-driven curriculum learning fine-tuning paradigm and (2) release a synthetic dataset of over 80,000 samples to enable efficient adaptation of smaller models. Results: Fine-tuned Llama3-8B achieves 68.2% success rate, substantially outperforming GPT-4V (20%) and Llama3-70B (2.35%). The benchmark, dataset, and code are publicly released to advance research in multi-skill compositional modeling.
📝 Abstract
Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment. The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment, each requiring a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models, and provide an in-depth analysis of the models' expertise across different skill dimensions. We will publicly release the benchmark for future research on program synthesis in visual programming.