MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate the capacity of multimodal large language models to perform deep compositional conditional reasoning within visual workflows, particularly lacking systematic assessment of multi-layered condition chains grounded in visual evidence. To address this gap, this work introduces the first visual reasoning benchmark supporting deep compositional condition chains, leveraging a verifiable program intermediate representation (VPIR)-based agent synthesis pipeline. This pipeline integrates a planner, VPIR, and a synthesizer to automatically generate scalable and verifiable multi-layer conditional reasoning samples across three domains: natural images, data charts, and GUI trajectories. Experimental results reveal that even the strongest current models achieve only a 53.33% path-level F1 score, with performance sharply degrading on hard negative samples or as reasoning depth and predicate complexity increase, underscoring that deep compositional visual reasoning remains a significant challenge.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
visually grounded reasoning
compositional conditionals
deep reasoning chains
visual workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

deep compositional reasoning
programmatically verified benchmark
multimodal large language models
visual grounding
agent-based synthesis pipeline
🔎 Similar Papers
No similar papers found.
H
Haozhan Shen
Accio Team, Alibaba Group; Zhejiang University
Shilin Yan
Shilin Yan
Fudan University
MLLMsComputer VisionMulti-Modal
Hongwei Xue
Hongwei Xue
University of Science and Technology of China
Multi-ModalVision-Language
S
Shuaiqi Lu
Accio Team, Alibaba Group
X
Xiaojun Tang
Accio Team, Alibaba Group
G
Guannan Zhang
Accio Team, Alibaba Group
T
Tiancheng Zhao
ZJU-BJ
Jianwei Yin
Jianwei Yin
Professor of Computer Science and Technology, Zhejiang University
Service ComputingComputer ArchitectureDistributed ComputingAI