π€ AI Summary
This study addresses the challenge of step alignment between assembly diagrams and instructional videos caused by the depiction gapβthe visual disparity in how objects and actions are represented across modalities. To systematically evaluate this issue, the authors introduce IKEA-Bench, a benchmark encompassing 29 IKEA furniture items, 1,623 questions, and six task categories, used to assess 19 vision-language models under three alignment strategies. Through a three-tier analytical framework, they quantitatively characterize cross-depiction alignment difficulties for the first time, identifying visual encoding as the primary bottleneck for robust performance. The analysis reveals that model architecture family is a stronger predictor of alignment accuracy than parameter count, and uncovers phenomena such as ViT subspace separation and text-guided reasoning bias. Notably, while textual input aids instruction comprehension, it degrades visual alignment, and video understanding cannot be substantially improved through alignment strategies alone.
π Abstract
2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/