Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-vocabulary human action segmentation under zero-shot, unsupervised settings—i.e., without labeled data or predefined action categories. We propose ZOMG, a zero-shot open-vocabulary action localization framework that leverages large language models (LLMs) to generate semantically ordered sub-action descriptions and employs test-time instance-aware soft temporal masking to jointly model intra-segment continuity and inter-segment separability—without fine-tuning pretrained encoders. Its core innovation lies in the end-to-end coupling of linguistic semantic decomposition with learnable temporal masks, enabling unsupervised, zero-shot action understanding in realistic scenarios. On three benchmarks—including HumanML3D—ZOMG achieves an 8.7% improvement in mean Average Precision (mAP) over state-of-the-art methods and demonstrates strong generalization in downstream cross-modal retrieval tasks.

Technology Category

Application Category

📝 Abstract
Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
Problem

Research questions and friction points this paper is trying to address.

Segmenting human motion into semantic sub-actions without annotations
Overcoming limitations of predefined action classes in open-vocabulary settings
Enabling zero-shot motion grounding without fine-tuning using language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large language models for semantic motion decomposition
Uses soft masking optimization for temporal frame focusing
Maintains pretrained encoder without fine-tuning or annotations
🔎 Similar Papers
No similar papers found.