Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of zero-shot compositional action recognition, where models often fail to generalize to unseen verb-object combinations due to reliance on object-driven verb shortcuts. The authors identify that this limitation stems from existing methods overfitting to co-occurrence statistics in training data while neglecting temporal visual evidence. To mitigate this, they propose the RCORE framework, which integrates composition-aware data augmentation with a temporal ordering regularization loss to encourage the model to learn verb semantics within temporally aligned visual contexts, thereby reducing dependence on co-occurrence bias. Evaluated on the Sth-com benchmark and a newly constructed EK100-com dataset, RCORE significantly improves recognition accuracy on unseen compositions and consistently yields positive compositional generalization gains.

Technology Category

Application Category

📝 Abstract
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Compositional Action Recognition
object-driven shortcuts
compositional video understanding
verb-object composition
co-occurrence bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-driven shortcuts
zero-shot compositional action recognition
temporal order regularization
composition-aware augmentation
compositional video understanding
🔎 Similar Papers
No similar papers found.