RECIPE: Procedural Planning via Grounding in Instructional Video

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

160K/year
🤖 AI Summary
Existing visual planning methods struggle to generalize due to scarce annotations, narrow domains, and reliance on single-trajectory assumptions. This work proposes a novel paradigm that leverages large-scale, noisy instructional videos by treating ASR-transcribed text as a low-cost verifier—rather than a source of labels—and constructs a reward signal based on the temporal alignment quality between generated step sequences and the transcribed text. The planning model is optimized via GRPO reinforcement learning, exploiting the asymmetry between verification and generation to extract scalable rewards from unlabeled, noisy video data without manual annotation. Supporting multimodal inputs, the approach significantly outperforms supervised fine-tuning across seven benchmarks, achieving 7–8 percentage point gains in in-domain macro accuracy and up to 16 points in zero-shot performance while preserving output diversity.
📝 Abstract
Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.
Problem

Research questions and friction points this paper is trying to address.

visual planning
procedural planning
instructional video
noisy ASR
execution trajectory
Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural planning
instructional video grounding
reinforcement learning from verification
weakly supervised learning
temporal alignment