Explain Before You Answer: A Survey on Compositional Visual Reasoning

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

The field of compositional visual reasoning (CVR) lacks a systematic, up-to-date survey. Method: We conduct a large-scale bibliometric analysis of over 260 top-conference papers published between 2023 and 2025, constructing the first unified taxonomy spanning the full technical stack and a five-stage paradigm evolution framework—from prompt-enhanced vision-language models (VLMs) to agentified VLMs. Contribution/Results: We identify three core challenges: interpretable reasoning, high-resolution perceptual fidelity, and chain-of-thought faithfulness. We consolidate 60+ multimodal benchmarks and cognitive evaluation protocols, exposing fundamental limitations including LLM hallucination and inductive bias. Finally, we propose key future directions—world model integration and human-AI collaborative reasoning—to advance cognitive alignment and semantic interpretability in multimodal AI. This work provides a foundational reference for next-generation CVR research.

Technology Category

Application Category

📝 Abstract

Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing rapidly expanding literature on compositional visual reasoning in AI

Systematically reviewing 260+ papers to formalize definitions and paradigm shifts

Identifying open challenges and future directions for visual reasoning research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Surveying compositional visual reasoning literature

Tracing paradigm shift in multimodal AI

Cataloging benchmarks for evaluation metrics

🔎 Similar Papers

What Makes a Maze Look Like a Maze?