🤖 AI Summary
This study investigates the impact of semantic enhancement on co-speech gesture generation quality and human perception. We propose two frameworks—AQ-GT (baseline) and AQ-GT-a (explicit semantic enhancement variant)—trained on the SAGA corpus, and conduct user-centered evaluations along two dimensions: concept recognition accuracy and anthropomorphism. Results reveal that explicit semantic enhancement does not universally improve performance: AQ-GT achieves superior in-domain concept conveyance, whereas AQ-GT-a, though not significantly enhancing anthropomorphism, demonstrates greater expressiveness, helpfulness, and cross-scenario generalization—particularly in representing shape and size. The core contribution lies in empirically uncovering the trade-off between semantic specialization and model generalization, thereby providing evidence for both the necessity and viable implementation strategies of semantic modeling in co-speech gesture generation.
📝 Abstract
This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.