🤖 AI Summary
Existing HOI detection benchmarks (e.g., HICO-DET) rely on single-annotation matching, which is incompatible with the inherently ambiguous, generative outputs of vision-language models (VLMs), leading to erroneous rejection of semantically valid predictions. Method: We propose a novel HOI benchmark enabling unified evaluation of both generative and discriminative models. It introduces a multiple-answer, multiple-choice evaluation protocol integrating semantic plausibility and generative characteristics, comprising authentic positive instances and disambiguated negative examples. We further design a multi-answer matching mechanism aligned with VLM output distributions to avoid penalizing valid alternatives due to annotation uniqueness. Contribution/Results: Our benchmark enables fair, fine-grained assessment of VLMs’ true HOI comprehension capabilities—revealing their current limitations—and establishes a more inclusive, semantically grounded evaluation paradigm for HOI detection.
📝 Abstract
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.