Hulk: A Universal Knowledge Translator for Human-Centric Tasks

๐Ÿ“… 2023-12-04
๐Ÿ›๏ธ IEEE Transactions on Pattern Analysis and Machine Intelligence
๐Ÿ“ˆ Citations: 14
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing human-centric perception models suffer from modality fragmentation and task-specific fine-tuning, limiting generalization across diverse downstream tasks. This paper introduces the first multimodal foundation model for human-centric tasks, unifying 2D/3D vision, skeletal action modeling, and vision-language understanding without task-specific adaptation. Our approach features three key innovations: (1) a novel dual universal head architecture that jointly models discrete representations (e.g., pose labels) and continuous ones (e.g., 3D coordinates); (2) a โ€œcross-modal knowledge translationโ€ paradigm that bridges semantic gaps via modality-aligned embeddings and unified sequence-based representation; and (3) multi-task joint pretraining coupled with prompt-driven inference. Evaluated on 12 benchmarks spanning eight task categories, our model achieves state-of-the-art performance on 11 tasks, demonstrating substantial improvements in cross-task generalization and transferability.
๐Ÿ“ Abstract
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in human-centric foundation models for 3D and vision-language tasks
Eliminating task-specific finetuning for diverse human-centric perception tasks
Integrating knowledge across 2D, 3D, skeleton-based, and vision-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal generalist model for diverse tasks
Two general heads for discrete and continuous representations
Uniform representation enabling knowledge integration
Y
Yizhou Wang
Shanghai AI Laboratory, Shanghai, 200232, China
Yixuan Wu
Yixuan Wu
Postdoc fellow @ JHU, special volunteer @ NIH
Photoacoustic imagingUltrasound tomographyMedical roboticsSignal processing
W
Weizhen He
College of Electrical Engineering, Zhejiang University, Hangzhou, 310027, China
Xun Guo
Xun Guo
Department of Automation, University of Science and Technology of China, Hefei, 230052, China
F
Feng Zhu
SenseTime Group Limited, China
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
R
Rui Zhao
SenseTime Group Limited, China
J
Jian Wu
School of Public Health, Zhejiang University, Hangzhou, 310058, China
T
Tong He
Shanghai AI Laboratory, Shanghai, 200232, China
W
Wanli Ouyang
Shanghai AI Laboratory, Shanghai, 200232, China
S
Shixiang Tang
School of Electrical and Information Engineering, University of Sydney, NSW, Australia