🤖 AI Summary
Existing benchmarks for 6D object pose estimation fail to capture the challenges posed by extreme conditions in first-person viewpoints—such as severe motion blur, dynamic lighting, and visual occlusions—leading to poor generalization of models in real-world scenarios. To address this gap, this work presents EgoXtreme, the first large-scale dataset systematically constructed from first-person perspectives using smart glasses across three demanding real-world domains: industrial maintenance, sports, and emergency response. Experimental evaluation reveals that state-of-the-art methods suffer significant performance degradation under low-light, motion-blurred, and smoke-obscured conditions. While image restoration alone proves ineffective, approaches leveraging temporal information through tracking demonstrate superior robustness. This study establishes a critical benchmark and provides empirical insights for advancing robust 6D pose estimation in challenging first-person settings.
📝 Abstract
Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/