๐ค AI Summary
Existing vision-language models exhibit limited performance on human pose and action understanding, primarily due to the absence of structured, human-centric instruction data. To address this, we propose a novel multimodal instruction data synthesis method that jointly incorporates human keypoints, bounding boxes, and image captions. This is the first approach to deeply embed keypoint-structured semantic representations into the instruction-tuning pipeline, enabling support for three task categories: conversational interaction, fine-grained description, and complex spatial reasoning. Built upon the LLaVA-7B architecture, our framework integrates pose estimation, cross-modal alignment, and controllable multimodal data generation. Experimental results demonstrate a 21.18% improvement over baseline models on human poseโrelated benchmarks, significantly enhancing the modelโs capacity for semantic understanding, spatial-temporal reasoning, and generative fidelity in human action interpretation.
๐ Abstract
Current multimodal models are well-suited for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions, primarily due to the lack of specialized instruction-following data. We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes. Our approach produces datasets designed for fine-tuning models to excel in human-centric activities, focusing on three specific types: conversation, detailed description, and complex reasoning. We fine-tuned the LLaVA-7B model with this novel dataset, achieving significant improvements across various human pose-related tasks. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model. These findings demonstrate the effectiveness of keypoints-assisted data in enhancing multimodal models.