METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Open-vocabulary video visual relation detection suffers from error propagation from object detection to relation classification in cascaded pipelines. This paper proposes the first query-driven unified framework that jointly models object detection and relation classification, enabling their co-optimization via a bidirectional mutual enhancement mechanism. Key contributions include: (1) a CLIP-based contextual refinement encoding module that enhances generalization to unseen object and relation categories; and (2) an iterative mutual enhancement module that explicitly captures semantic dependencies between objects and relations. The method integrates vision-language alignment, query-based Transformer architecture, context-aware feature refinement, and iterative representation enhancement. Evaluated on VidVRD and VidOR, it achieves state-of-the-art performance, significantly improving joint recognition accuracy for unknown-category objects and relations in open-vocabulary settings.

Technology Category

Application Category

📝 Abstract

Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Detect objects and relationships in videos without predefined categories

Overcome error propagation in cascaded object-relationship pipelines

Jointly enhance object and relationship recognition in open-vocabulary scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-based unified framework for joint object-relationship modeling

CLIP-based contextual refinement encoding for novel categories

Iterative enhancement module exploiting object-relationship interdependence

🔎 Similar Papers

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting