🤖 AI Summary
This work addresses the challenge of users struggling to precisely articulate choreographic intent using natural language amid the surge of online dance content. To this end, we propose CustomDancer, a multimodal dance retrieval framework that jointly models textual semantics, musical rhythm, and full-body motion dynamics through an end-to-end cross-modal alignment architecture comprising a CLIP-based text encoder, dedicated music and motion encoders, and a fusion module. We also introduce TD-Data, the first large-scale, expert-annotated text-dance aligned dataset, enabling systematic evaluation of such systems. On TD-Data, CustomDancer achieves a Recall@1 of 10.23%, substantially outperforming existing methods, and user studies further confirm its superior recommendation quality.
📝 Abstract
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic intent, but it remains underexplored because dance requires simultaneous reasoning over linguistic semantics, musical rhythm, and full-body motion dynamics. We introduce TD-Data, a large-scale open dataset for text-dance retrieval, containing about 4,000 12-second dance clips, 14.6 hours of motion, 22 genres, and annotations from professional dance experts. On top of this dataset, we propose CustomDancer, a multimodal retrieval framework that aligns text with dance through a CLIP-based text encoder, music and motion encoders, and a music-motion blending module. CustomDancer achieves state-of-the-art performance on TD-Data, reaching 10.23% Recall@1 and improving retrieval quality in both quantitative benchmarks and user preference studies.