π€ AI Summary
Existing datasets exhibit structural limitations in scene and camera diversity, modeling of human-human and human-object interactions, and alignment of individual-level attributes, hindering the development of high-fidelity human-centric video generation. To address these gaps, this work introduces OmniHumanβa large-scale, multi-scene dataset featuring hierarchical annotations spanning video-level scenes, frame-level interactions, and individual-level attributes, accompanied by a fully automated, high-quality data collection and multimodal annotation pipeline. Furthermore, we propose OHBench, the first three-tier evaluation framework tailored for human-centric video generation, which incorporates novel metrics highly aligned with human perception to comprehensively assess global scene coherence, relational interactions, and individual attribute fidelity. Experiments demonstrate that OmniHuman substantially enhances the scientific rigor and effectiveness of evaluating generative models in this domain.
π Abstract
Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.