🤖 AI Summary
Current general-purpose multimodal models exhibit limited performance in human-centric video understanding—such as joint modeling of actions, facial expressions, emotions, and contextual environments—primarily due to the scarcity of high-quality human-centric data and the lack of dedicated architectural designs. To address this, we propose the first large-scale vision–speech–language multimodal model explicitly tailored for human-centric scenarios. Our approach introduces a human-centric multi-branch encoder and an instruction-driven adaptive cross-modal fusion mechanism. We further construct a high-quality human-centric multimodal dataset comprising 2.4 million video clips and 14 million instruction-following annotations. The model is trained via multimodal pretraining and cross-modal alignment techniques. Extensive evaluations demonstrate substantial improvements over state-of-the-art general-purpose models on emotion recognition, facial description, and action understanding tasks. The model and dataset are publicly released to advance research in human–computer interaction and embodied intelligence.
📝 Abstract
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions, facilitating the understanding of diverse human-centric scenes. HumanOmni includes three specialized branches for understanding different types of scenes. It adaptively fuses features from these branches based on user instructions, significantly enhancing visual understanding in scenes centered around individuals. Moreover, HumanOmni integrates audio features to ensure a comprehensive understanding of environments and individuals. Our experiments validate HumanOmni's advanced capabilities in handling human-centric scenes across a variety of tasks, including emotion recognition, facial expression description, and action understanding. Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.