In-Video Instructions: Visual Signals as Generative Control

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the global, ambiguous, and spatially imprecise nature of text-based prompting for image-to-video generation, this work introduces *In-Video Instruction*—a novel paradigm that embeds structured visual signals (e.g., overlaid text, arrows, motion trajectories) directly into input frames, enabling pixel-level spatial alignment and unambiguous, fine-grained control. Built upon state-of-the-art video diffusion models—including Veo 3.1, Kling 2.5, and Wan 2.2—we design a lightweight visual instruction encoder and conditional injection mechanism, allowing models to interpret and execute spatially grounded, multi-object, multi-action instructions without fine-tuning. Experiments demonstrate substantial improvements in instruction adherence and action localization accuracy, particularly in complex multi-object scenarios. This work provides the first systematic validation that off-the-shelf video generation models can reliably parse embedded visual instructions, establishing a scalable, high-precision pathway for controllable video synthesis.

Technology Category

Application Category

📝 Abstract
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
Problem

Research questions and friction points this paper is trying to address.

Control image-to-video generation using visual signals
Enable spatial-aware instructions for multiple objects
Interpret embedded visual cues like arrows and text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual signals embedded in frames as instructions
Spatial-aware control through overlaid text and arrows
Assigning distinct visual instructions to different objects
🔎 Similar Papers
No similar papers found.