Think Visually, Reason Textually: Vision-Language Synergy in ARC

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

State-of-the-art foundation models significantly underperform humans in few-shot structured rule induction (e.g., ARC-AGI tasks), primarily due to neglecting the pivotal role of visual abstraction in human reasoning. Method: We propose the Vision–Language Synergistic Reasoning (VLSR) framework, which enables cross-stage complementarity between image perception and symbolic reasoning via modality-aligned subtask decomposition and a modality-switching self-correction (MSSC) mechanism. Unlike conventional approaches that treat ARC-AGI inputs as raw images, VLSR explicitly models semantic consistency between visual abstractions and textual rules, mitigating rule-execution errors induced by pixel-level distortions. Contribution/Results: Experiments across multiple leading large language and multimodal models demonstrate that VLSR achieves up to a 4.33% absolute accuracy gain over text-only baselines—constituting the first systematic empirical validation of intrinsic multimodal synergy for abstract reasoning.

Technology Category

Application Category

📝 Abstract

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Enhancing abstract reasoning in AI through vision-language synergy

Improving rule induction from minimal examples for ARC-AGI tasks

Overcoming limitations of pure text or vision approaches in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Synergy Reasoning for ARC-AGI tasks

Modality-Switch Self-Correction for error verification

Decomposing reasoning into modality-aligned subtasks

🔎 Similar Papers

No similar papers found.