Stitch-a-Recipe: Video Demonstration from Multistep Descriptions

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating temporally coherent videos from multi-step textual instructions (e.g., recipes) remains challenging, as existing methods rely on sentence-level alignment and treat steps in isolation, leading to temporal discontinuities. Method: We propose the first retrieval-based multi-step video stitching framework. Our approach constructs a large-scale, weakly supervised recipe-video dataset; introduces multi-granularity cross-modal contrastive retrieval to achieve inter-step semantic alignment; jointly optimizes generation fidelity and visual coherence via hard negative mining; and ensures temporal consistency through clip re-ranking and smooth splicing. Contribution/Results: Evaluated on real-world instructional videos, our method achieves state-of-the-art performance, with key metrics improving by up to 24% over prior work. Human evaluation confirms significantly higher preference scores compared to all baselines.

Technology Category

Application Category

📝 Abstract
When obtaining visual illustrations from text descriptions, today's methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe composed of multiple steps. Furthermore, simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse and novel recipes and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.
Problem

Research questions and friction points this paper is trying to address.

Visual illustration of multistep text descriptions
Coherent video assembly from diverse step descriptions
Retrieval-based method for accurate and coherent video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-based video assembly from multistep descriptions
Training pipeline with weakly supervised diverse recipes
Injection of hard negatives for coherence and correctness
🔎 Similar Papers
No similar papers found.