Predicting Implicit Arguments in Procedural Video Instructions

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Procedural video instructions often omit implicit semantic arguments (e.g., “what” and “where/with”), leading to incomplete reasoning. Method: This paper introduces Implicit-VidSRL—the first implicit semantic role labeling dataset tailored for video procedural text—and proposes iSRL-Qwen2-VL, a multimodal architecture built upon Qwen2-VL. It integrates video frame sequence modeling, cross-modal alignment, and verb-driven context-aware decoding. Contribution/Results: We conduct the first systematic evaluation of multimodal large language models on entity visual tracking and cross-step implicit reasoning. Experiments show that iSRL-Qwen2-VL achieves F1 improvements of +17.0% and +14.7% over GPT-4o on “what” and “where/with” implicit argument prediction, respectively, significantly enhancing fine-grained semantic parsing of multi-step cooking instructions.

Technology Category

Application Category

📝 Abstract

Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Predicting implicit arguments in procedural video instructions

Addressing incomplete understanding in Semantic Role Labeling benchmarks

Improving multimodal models' contextual reasoning for implicit arguments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Implicit-VidSRL dataset for implicit arguments

Benchmarks multimodal models' contextual reasoning

Proposes iSRL-Qwen2-VL with improved F1-score

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs