Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge of zero-shot compositional action recognition, where models often fail to generalize to unseen verb-object combinations due to reliance on object-driven verb shortcuts. The authors identify that this limitation stems from existing methods overfitting to co-occurrence statistics in training data while neglecting temporal visual evidence. To mitigate this, they propose the RCORE framework, which integrates composition-aware data augmentation with a temporal ordering regularization loss to encourage the model to learn verb semantics within temporally aligned visual contexts, thereby reducing dependence on co-occurrence bias. Evaluated on the Sth-com benchmark and a newly constructed EK100-com dataset, RCORE significantly improves recognition accuracy on unseen compositions and consistently yields positive compositional generalization gains.

Technology Category

Application Category

📝 Abstract

We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Compositional Action Recognition

object-driven shortcuts

compositional video understanding

verb-object composition

co-occurrence bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-driven shortcuts

zero-shot compositional action recognition

temporal order regularization