Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

In open-vocabulary action recognition (OVAR), multimodal large language models (MLLMs) suffer from text-prior bias, hindering discrimination among semantically similar actions. To address this, we propose Video-STAR—a novel framework for robust, fine-grained action understanding. Its core contributions are threefold: (1) explicit decomposition of actions into discriminative sub-action sequences for fine-grained modeling; (2) integration of dynamically invocable domain-specific tools, coupled with tool-augmented reinforcement learning and a hierarchical reward mechanism, enabling unsupervised tool selection and structured reasoning; and (3) a cross-modal interleaved modeling module that enhances vision-language alignment and grounded reasoning robustness. Video-STAR achieves state-of-the-art performance on HMDB-51, UCF-101, Something-Something v2, and Kinetics-400/600, significantly improving fine-grained recognition accuracy and resilience against cross-modal hallucination.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Problem

Research questions and friction points this paper is trying to address.

Addressing text-centric bias in multimodal action recognition

Disentangling semantically similar open-vocabulary human actions

Reducing cross-modal hallucinations in video-language reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes actions into discriminative sub-motions

Dynamically invokes domain-specific tools for reasoning

Uses hierarchical reward to balance tool efficiency

🔎 Similar Papers

Open-Vocabulary Action Localization With Iterative Visual Prompting