ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification

πŸ“… 2025-04-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the significant cross-modal appearance discrepancies and insufficient modeling of high-level semantic and body-shape features in visible-infrared person re-identification (VIReID), this paper introduces, for the first time, human shape priors and proposes a Body-Shape-aware Text Alignment (BSaTa) framework. BSaTa incorporates a Body Structure-to-Text Alignment (BSTA) module that converts parsed body structures into structured textual descriptions; a Text-Visual Cross-Modal Regularizer (TVCR) and a Semantic Refinement Loss (SRL) to enhance modality invariance and discriminability; and integrates human parsing, CLIP-based text encoding, multi-text supervision, distribution consistency constraints, and vision–text co-optimization. Extensive experiments demonstrate state-of-the-art performance on SYSU-MM01 and RegDB, achieving substantial improvements in cross-modal matching accuracy.

Technology Category

Application Category

πŸ“ Abstract
Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Address modality differences in Visible-Infrared Person Re-identification
Enhance body shape feature modeling for cross-modal matching
Improve semantic alignment between textual and visual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Body Shape-aware Textual Alignment (BSaTa) framework
Body Shape Textual Alignment (BSTA) module
Shape-aware Representation Learning (SRL) mechanism
πŸ”Ž Similar Papers
No similar papers found.
S
Shuanglin Yan
Nanjing University of Science and Technology
Neng Dong
Neng Dong
Nanjing University of Science and Technology
S
Shuang Li
Chongqing University of Posts and Telecommunications
R
Rui Yan
Nanjing University of Science and Technology
H
Hao Tang
The Hong Kong Polytechnic University
Jing Qin
Jing Qin
University of Southern Denmark
MathematicsStatistics