MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models exhibit weak performance on visual-symbolic compositional reasoning (VSCR), particularly in tasks requiring precise structural manipulation under strict constraints. Method: We introduce MathSticks, the first systematic benchmark for matchstick equation correction—requiring models to restore arithmetic validity by moving 1–2 matchsticks while respecting conservation constraints. It comprehensively spans digit scale, operator complexity, solution-space diversity, and operator variation. The benchmark comprises 1.4 million synthetic samples and a human-curated high-quality test set, supporting both text-guided and vision-only evaluation paradigms. Contribution/Results: Evaluation across 14 state-of-the-art models reveals that closed-source models succeed only on trivial instances, open-source models fail nearly completely in vision-only mode, and human accuracy exceeds 90%. MathSticks establishes a rigorous, scalable, and multimodal standard for assessing compositional reasoning capabilities in vision-language systems.

Technology Category

Application Category

📝 Abstract
We introduce extsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90% accuracy. These findings establish extsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.
Problem

Research questions and friction points this paper is trying to address.

Correcting incorrect matchstick equations by moving sticks under conservation rules
Unifying visual perception, symbolic manipulation, and arithmetic consistency reasoning
Systematically testing digit scale, move complexity, and operator variation puzzles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual symbolic reasoning benchmark with matchstick puzzles
Generates 1.4M instances with text and visual settings
Tests 14 vision-language models against human performance
🔎 Similar Papers
No similar papers found.