Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

๐Ÿ“… 2026-03-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the origins of performance disparities between humans and AI in egocentric action recognition, particularly under spatial reduction and temporal disruption conditions. Through large-scale comparative experiments, it systematically integrates spatial cropping with temporal perturbation for the first time, introducing the concept of the โ€œMinimal Identifiable Cropped Regionโ€ (MIRC) and a novel LTA/HTA action taxonomy. Leveraging the Epic ReduAct dataset, MIRC annotations, and Recognition Gap metrics, the work reveals that human performance sharply declines when transitioning from MIRC to sub-MIRC regions, indicating strong reliance on sparse semantic cues such as hand-object interactions. In contrast, the Side4Video model exhibits a more gradual performance drop, relying predominantly on contextual and low-level visual features, showing insensitivity to temporal disruptions yet displaying category-dependent temporal sensitivity.

Technology Category

Application Category

๐Ÿ“ Abstract
Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
Problem

Research questions and friction points this paper is trying to address.

Human-AI divergence
egocentric action recognition
spatial manipulation
spatiotemporal manipulation
performance gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

MIRCs
egocentric action recognition
human-AI divergence
spatiotemporal manipulation
temporal sensitivity
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Sadegh Rahmaniboldaji
University of Surrey, Guildford, UK
F
Filip Rybansky
Newcastle University, Newcastle upon Tyne, UK
Q
Quoc C. Vuong
Newcastle University, Newcastle upon Tyne, UK
A
Anya C. Hurlbert
Newcastle University, Newcastle upon Tyne, UK
Frank Guerin
Frank Guerin
University of Surrey
Natural Language ProcessingComputer VisionRoboticsArtificial Intelligence
Andrew Gilbert
Andrew Gilbert
University of Surrey
Machine LearningVideo UnderstandingComputer Vision