FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study investigates whether multimodal large language models (MLLMs) can abstract mathematical symbolic rules from visual inputs—a critical bottleneck in vision–mathematical joint reasoning. To this end, we introduce FractalBench, a novel benchmark comprising 12 classes of classical fractal images generated via iterated function systems (IFS), requiring models to reconstruct their recursive geometric structures as executable code. We propose a robust evaluation framework resistant to visual perturbations, assessing both syntactic correctness of generated code and fidelity to the underlying mathematical structure. Experiments span state-of-the-art MLLMs, including GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL. Results reveal that while 76% of generated programs are syntactically correct, only 4% accurately preserve the target mathematical structure. Success rates on Koch curve reconstruction range from 17% to 21%, and tree-like recursive structures fall below 2%, exposing systematic deficiencies in modeling branching recursion and geometric transformations.

Technology Category

Application Category

📝 Abstract

Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees:<2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench

Problem

Research questions and friction points this paper is trying to address.

Diagnosing AI's ability to synthesize fractal programs from visual patterns

Evaluating multimodal models on bridging visual perception with mathematical abstraction

Testing recursive reasoning through executable code generation for fractals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes fractal programs from visual patterns

Evaluates multimodal models via executable Python code

Diagnoses visual-mathematical reasoning through recursive structures

🔎 Similar Papers

Using a CNN Model to Assess Paintings' Creativity

2024-08-02Citations: 0