Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing HOI detection benchmarks (e.g., HICO-DET) rely on single-annotation matching, which is incompatible with the inherently ambiguous, generative outputs of vision-language models (VLMs), leading to erroneous rejection of semantically valid predictions. Method: We propose a novel HOI benchmark enabling unified evaluation of both generative and discriminative models. It introduces a multiple-answer, multiple-choice evaluation protocol integrating semantic plausibility and generative characteristics, comprising authentic positive instances and disambiguated negative examples. We further design a multi-answer matching mechanism aligned with VLM output distributions to avoid penalizing valid alternatives due to annotation uniqueness. Contribution/Results: Our benchmark enables fair, fine-grained assessment of VLMs’ true HOI comprehension capabilities—revealing their current limitations—and establishes a more inclusive, semantically grounded evaluation paradigm for HOI detection.

Technology Category

Application Category

📝 Abstract
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs versus specialized HOI methods for interaction detection
Addressing misalignment between generative VLMs and exact-match HOI benchmarks
Reducing penalization of valid predictions in ambiguous HOI scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing a multiple-choice benchmark for HOI detection
Replacing exact matching with multiple valid interpretations
Enabling direct comparison between VLMs and HOI methods
🔎 Similar Papers
No similar papers found.