🤖 AI Summary
This study addresses the lack of capability assessment for large language models (LLMs) in elementary-school visual programming computational thinking evaluation. We introduce the first standardized benchmark specifically designed for this domain, covering multi-level skills—including recognition, selection, and synthesis. To address data scarcity and ensure pedagogical validity, we propose a symbol-rule-based hierarchical synthetic data generation paradigm that explicitly models the developmental progression of computational thinking competencies, which subsequently guides supervised fine-tuning (SFT). Experiments on GPT-4o and Llama3 demonstrate that fine-tuned models significantly outperform baselines on elementary computational thinking assessments, achieving performance comparable to the average human student score. To foster reproducibility and community advancement, we fully open-source all benchmark datasets, symbolic generation rules, and training code—establishing a foundational resource for educational AI evaluation and domain adaptation.
📝 Abstract
Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models' performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.