🤖 AI Summary
This work addresses the challenges of structured understanding and cross-domain generalization for long-horizon, multimodal procedural videos—exemplified by extravehicular activities (EVAs) aboard the International Space Station. We introduce the first benchmark tailored to real-world space operations, featuring long-duration, multimodal procedural video understanding with two core tasks: step identification and video question answering. To enable zero-shot or few-shot domain adaptation without fine-tuning, we propose a summary-guided adaptation method that integrates lightweight multimodal fusion (vision + speech), temporal action segmentation, and summary-informed reasoning. Experiments reveal substantial performance gaps of existing models on cross-domain long-video understanding; our approach achieves up to a 14.2% absolute accuracy improvement under no-fine-tuning conditions. The benchmark is publicly released, establishing a new standard for multimodal procedural video understanding in aerospace applications.
📝 Abstract
Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.