Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models

๐Ÿ“… 2024-09-14
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision-language models exhibit limited performance on human pose and action understanding, primarily due to the absence of structured, human-centric instruction data. To address this, we propose a novel multimodal instruction data synthesis method that jointly incorporates human keypoints, bounding boxes, and image captions. This is the first approach to deeply embed keypoint-structured semantic representations into the instruction-tuning pipeline, enabling support for three task categories: conversational interaction, fine-grained description, and complex spatial reasoning. Built upon the LLaVA-7B architecture, our framework integrates pose estimation, cross-modal alignment, and controllable multimodal data generation. Experimental results demonstrate a 21.18% improvement over baseline models on human poseโ€“related benchmarks, significantly enhancing the modelโ€™s capacity for semantic understanding, spatial-temporal reasoning, and generative fidelity in human action interpretation.

Technology Category

Application Category

๐Ÿ“ Abstract
Current multimodal models are well-suited for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions, primarily due to the lack of specialized instruction-following data. We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes. Our approach produces datasets designed for fine-tuning models to excel in human-centric activities, focusing on three specific types: conversation, detailed description, and complex reasoning. We fine-tuned the LLaVA-7B model with this novel dataset, achieving significant improvements across various human pose-related tasks. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model. These findings demonstrate the effectiveness of keypoints-assisted data in enhancing multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized data for human pose and action tasks
Need for precise understanding of human-centric scenes
Improving multimodal models' performance on human-centric tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates human keypoints with visual features
Generates specialized vision-language instruction-following data
Fine-tunes models for human pose understanding
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Dewen Zhang
The University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo, Japan
W
Wangpeng An
TikTok Inc., 1199 Coleman Ave, San Jose, CA 95110
H
Hayaru Shouno
The University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo, Japan