Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

📅 2024-06-14

🏛️ Neural Information Processing Systems

📈 Citations: 2

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the lack of capability assessment for large language models (LLMs) in elementary-school visual programming computational thinking evaluation. We introduce the first standardized benchmark specifically designed for this domain, covering multi-level skills—including recognition, selection, and synthesis. To address data scarcity and ensure pedagogical validity, we propose a symbol-rule-based hierarchical synthetic data generation paradigm that explicitly models the developmental progression of computational thinking competencies, which subsequently guides supervised fine-tuning (SFT). Experiments on GPT-4o and Llama3 demonstrate that fine-tuned models significantly outperform baselines on elementary computational thinking assessments, achieving performance comparable to the average human student score. To foster reproducibility and community advancement, we fully open-source all benchmark datasets, symbolic generation rules, and training code—establishing a foundational resource for educational AI evaluation and domain adaptation.

Technology Category

Application Category

📝 Abstract

Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models' performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.

Problem

Research questions and friction points this paper is trying to address.

Assessing generative models on elementary computational thinking tests.

Improving model performance using synthetic data generation.

Enhancing computational thinking skills in generative AI.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning models with synthetic data generation

Developing comprehensive datasets using symbolic methods

Enhancing computational thinking in generative models

🔎 Similar Papers

FADE-CTP: A Framework for the Analysis and Design of Educational Computational Thinking Problems