Substantial, Decomposable, and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses the detrimental impact of misalignment between instructional videos and users’ real-world visual contexts on physical task performance. The authors propose In-Context Instructional Videos (ICON), leveraging a Wizard-of-Oz approach to precisely control alignment across four decomposable visual attributes. Experimental results demonstrate that fully aligned ICON videos improve task quality by 11.1% and execution speed by 15.5%, whereas misalignment in any single attribute significantly degrades performance. Crucially, participants remain unaware of these performance costs despite their magnitude. This work provides the first quantitative evidence of the visual context misalignment effect, reveals its imperceptibility to users, and offers both theoretical insights and practical guidelines for designing immersive instructional systems.

📝 Abstract

Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.

Problem

Research questions and friction points this paper is trying to address.

visual context misalignment

instructional videos

physical tasks

task completion

user perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual context alignment

instructional video

physical task learning