🤖 AI Summary
This study investigates the root causes of failure in vision-language models on abstract visual reasoning tasks such as Bongard problems, specifically disentangling whether the bottleneck lies in visual representation or reasoning capability. To this end, the authors propose a Componential–Grammatical (C–G) framework that translates Bongard-LOGO images into symbolic action programs, enabling controlled inputs for systematic comparison between end-to-end vision models and large language models (LLMs) that receive only symbolic representations. Experimental results show that LLMs achieve over 95% accuracy with symbolic inputs, whereas strong vision-based baselines perform near chance level. Through ablation studies examining input formats, concept prompting, and minimal visual grounding, the work provides the first clear evidence that the primary limitation of current models stems from inadequate visual representation rather than reasoning capacity, thereby underscoring the critical role of symbolic structure in abstract reasoning.
📝 Abstract
Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.