๐ค AI Summary
Existing methods for speech-driven gesture generation often suffer from limited generalization, neglect of negative sample learning, and fragmented modeling of body parts. This work proposes a novel contrastive flow matching model that, for the first time, integrates contrastive learning into the flow matching framework. By incorporating mismatched audio-text pairs as negative samples, the model steers the velocity field to evolve along semantically consistent trajectories while repelling inconsistent ones. Furthermore, it constructs a unified multimodal latent space encompassing text, audio, and motion, enabling end-to-end cross-modal alignment. Evaluated on the BEAT2 and SHOW datasets, the proposed approach outperforms state-of-the-art methods, with user studies confirming significant improvements in both semantic coherence and overall naturalness of the generated gestures.
๐ Abstract
While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.