🤖 AI Summary
HOI detection exhibits insufficient robustness under distribution shifts, as existing methods predominantly assume ideal data distributions and thus suffer from limited real-world generalization. To address this, we introduce Robust-HOI—the first automated benchmark for evaluating robustness in HOI detection—systematically assessing over 40 state-of-the-art methods and uncovering prevalent cross-domain failure patterns. We propose a plug-and-play robust training framework that integrates MixUp-based regularization with cross-domain data augmentation, coupled with a frozen-vision-backbone-driven multimodal feature fusion mechanism to enhance semantic alignment and domain invariance. Our approach significantly improves model robustness across diverse distribution shift scenarios—including domain, viewpoint, and style shifts—while simultaneously achieving performance gains on standard benchmarks (HICO-DET and V-COCO). To foster reproducibility and community advancement, we will open-source the benchmark suite, annotated datasets, evaluation tools, and implementation code.
📝 Abstract
Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.