🤖 AI Summary
This work addresses the challenge of zero-shot compositional action recognition, where models often fail to generalize to unseen verb-object combinations due to reliance on object-driven verb shortcuts. The authors identify that this limitation stems from existing methods overfitting to co-occurrence statistics in training data while neglecting temporal visual evidence. To mitigate this, they propose the RCORE framework, which integrates composition-aware data augmentation with a temporal ordering regularization loss to encourage the model to learn verb semantics within temporally aligned visual contexts, thereby reducing dependence on co-occurrence bias. Evaluated on the Sth-com benchmark and a newly constructed EK100-com dataset, RCORE significantly improves recognition accuracy on unseen compositions and consistently yields positive compositional generalization gains.
📝 Abstract
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.