TurtleBench: A Visual Programming Benchmark in Turtle Geometry

📅 2024-10-31

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the critical limitations of large multimodal models (LMMs) in intuitive geometric reasoning and precise code generation. To this end, we introduce the first visual programming benchmark specifically designed for Turtle Geometry—a foundational paradigm in children’s computational thinking. The benchmark formalizes geometric intuition as structured cross-modal tasks that jointly require image/text input recognition, symbolic logic parsing, and executable Python code generation, enabling unified evaluation of both unimodal and multimodal capabilities. Its key contribution lies in being the first computationally rigorous and reproducible framework to quantify bottlenecks in vision–geometry–code joint reasoning. Experimental results reveal severe deficiencies: GPT-4o achieves only 19% accuracy on the simplest tasks, with few-shot prompting yielding negligible improvement (<2%), demonstrating a fundamental incapacity of current LMMs in algorithmic geometric reasoning.

Technology Category

Application Category

📝 Abstract

Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce TurtleBench, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns -- given visual examples, textual instructions, or both -- and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance ($<2%$). TurtleBench highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. TurtleBench stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: https://github.com/sinaris76/TurtleBench

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs' ability to interpret geometric patterns visually and textually

Assessing AI performance in intuitive visual geometry understanding

Bridging gap between human and AI in geometric reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses turtle geometry for visual programming benchmark

Evaluates LMMs on geometric pattern interpretation

Combines visual understanding with code generation

🔎 Similar Papers

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming