🤖 AI Summary
This study investigates whether multimodal large language models (MLLMs) can abstract mathematical symbolic rules from visual inputs—a critical bottleneck in vision–mathematical joint reasoning. To this end, we introduce FractalBench, a novel benchmark comprising 12 classes of classical fractal images generated via iterated function systems (IFS), requiring models to reconstruct their recursive geometric structures as executable code. We propose a robust evaluation framework resistant to visual perturbations, assessing both syntactic correctness of generated code and fidelity to the underlying mathematical structure. Experiments span state-of-the-art MLLMs, including GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL. Results reveal that while 76% of generated programs are syntactically correct, only 4% accurately preserve the target mathematical structure. Success rates on Koch curve reconstruction range from 17% to 21%, and tree-like recursive structures fall below 2%, exposing systematic deficiencies in modeling branching recursion and geometric transformations.
📝 Abstract
Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees:<2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench