🤖 AI Summary
Existing cross-domain 3D human motion models rely on domain-specific components and multi-stage training, limiting generalization and scalability. Method: We propose Human-in-Context (HiC), the first unified framework enabling joint multimodal (pose/mesh), multi-task, and multi-dataset modeling within a single end-to-end pipeline. Contributions/Results: HiC introduces (1) a max-min similarity prompting sampling strategy to enhance contextual awareness; (2) a dual-branch context injection architecture that disentangles and fuses cross-domain semantic information; and (3) a context-aware unified network, eliminating domain-customized modules. Evaluated across multiple benchmarks, HiC consistently outperforms state-of-the-art methods in cross-domain generalization, large-scale data adaptation, and zero-shot transfer—achieving new SOTA performance. The framework establishes a highly flexible, scalable, and general-purpose paradigm for 3D human motion modeling.
📝 Abstract
This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.