Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

📅 2024-02-16

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of recovering underrepresented regions in sparse novel-view synthesis. To tackle this, we introduce the first struggle recognition dataset for hand-assembled video tasks—comprising 5.1 hours and 725K frames—covering pipeline assembly, tent setup, and Tower of Hanoi solving. We formally define and annotate video-level struggle behavior, employing a dual-track annotation protocol (expert + crowdsourcing) with a four-point forced-scale rating scheme, enabling a multi-task benchmark for struggle classification, severity regression, and label distribution learning. We propose a spatiotemporal modeling framework integrating ResNet, ViT, and LSTM, and validate its fine-grained struggle state representation capability via ablation studies and attention visualizations. Experimental results demonstrate significant improvements in identifying sparse-region struggles, establishing a systematic baseline and a new paradigm for intelligent instructional assistance and adaptive human–machine interaction.

Technology Category

Application Category

📝 Abstract

Determining when people are struggling from video enables a finer-grained understanding of actions and opens opportunities for building intelligent support visual interfaces. In this paper, we present a new dataset with three assembly activities and corresponding performance baselines for the determination of struggle from video. Three real-world problem-solving activities including assembling plumbing pipes (Pipes-Struggle), pitching camping tents (Tent-Struggle) and solving the Tower of Hanoi puzzle (Tower-Struggle) are introduced. Video segments were scored w.r.t. the level of struggle as perceived by annotators using a forced choice 4-point scale. Each video segment was annotated by a single expert annotator in addition to crowd-sourced annotations. The dataset is the first struggle annotation dataset and contains 5.1 hours of video and 725,100 frames from 73 participants in total. We evaluate three decision-making tasks: struggle classification, struggle level regression, and struggle label distribution learning. We provide baseline results for each of the tasks utilising several mainstream deep neural networks, along with an ablation study and visualisation of results. Our work is motivated toward assistive systems that analyze struggle, support users during manual activities and encourage learning, as well as other video understanding competencies.

Problem

Research questions and friction points this paper is trying to address.

Recover underrepresented sparse regions in novel view synthesis.

Enhance initial point clouds for low- and high-uncertainty regions.

Apply uncertainty-aware weighted supervision for consistent performance gains.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Covisibility maps address region-specific uncertainty levels

Enhanced point clouds improve sparse region reconstruction

Adaptive supervision with covisibility-based weighting boosts performance

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories