๐ค AI Summary
This study addresses the lack of standardized evaluation in speech-driven 3D gesture generation, where subjective assessments are often confounded by inconsistent virtual character appearances and facial renderings, introducing perceptual bias. For the first time, the authors conduct a controlled user study within a unified framework to systematically evaluate how facial and bodily representations across seven representative rendering styles influence perceived gesture quality. By integrating multi-source gesture data, diverse rendering pipelines, and rigorous statistical analysis, the work reveals significant and systematic visual interference in human judgment of gestures, quantifies the sources of evaluation bias, and provides concrete recommendations for standardizing benchmarking protocols in gesture synthesis and humanโcomputer interaction applications.
๐ Abstract
The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications.