🤖 AI Summary
This work addresses the representational mismatch in fully multimodal large language models between the discrete nature of semantic reasoning and the dense temporal dynamics of speech-driven 3D facial animation. To resolve this, the authors propose the Ex-Omni framework, which decouples semantic reasoning from temporal generation by leveraging speech units as a temporal scaffold. A unified token-as-query gated fusion (TQGF) mechanism is introduced to enable controllable and temporally synchronized 3D facial animation synthesis. Coupled with the newly constructed InstructEx dataset, the proposed approach substantially reduces learning complexity and enhances multimodal alignment stability, achieving state-of-the-art performance among open-source fully multimodal large models across multiple evaluation metrics.
📝 Abstract
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.