🤖 AI Summary
Vision-language models exhibit weak performance on visual-symbolic compositional reasoning (VSCR), particularly in tasks requiring precise structural manipulation under strict constraints. Method: We introduce MathSticks, the first systematic benchmark for matchstick equation correction—requiring models to restore arithmetic validity by moving 1–2 matchsticks while respecting conservation constraints. It comprehensively spans digit scale, operator complexity, solution-space diversity, and operator variation. The benchmark comprises 1.4 million synthetic samples and a human-curated high-quality test set, supporting both text-guided and vision-only evaluation paradigms. Contribution/Results: Evaluation across 14 state-of-the-art models reveals that closed-source models succeed only on trivial instances, open-source models fail nearly completely in vision-only mode, and human accuracy exceeds 90%. MathSticks establishes a rigorous, scalable, and multimodal standard for assessing compositional reasoning capabilities in vision-language systems.
📝 Abstract
We introduce extsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90% accuracy. These findings establish extsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.