Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language models (VLMs) struggle to accurately reason about the next procedural step based on intermediate-state images in visual procedural question answering (VP-QA). To tackle this, the authors propose a Chain-of-Procedure hierarchical reasoning framework and introduce ProcedureVQA, the first benchmark dataset for VP-QA. The framework leverages cross-modal retrieval to align relevant instructions with visual states and employs semantic decomposition to refine procedural steps, enabling hierarchical action prediction. The study presents the first systematic evaluation of mainstream VLMs on VP-QA, identifying cross-modal retrieval and fine-grained step alignment as key bottlenecks. Experiments across six models demonstrate the effectiveness of the proposed approach, achieving up to a 13% absolute improvement in accuracy over baseline methods.

📝 Abstract

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.

Problem

Research questions and friction points this paper is trying to address.

visual procedure question answering

vision-language models

procedural reasoning

multimodal benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Procedure

visual procedural reasoning

vision-language models

multimodal benchmark