Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient multimodal fusion and ineffective language prior guidance for visual localization in occluded human pose estimation, this paper proposes PGVL—a parsing-graph-based vision-language interaction framework. PGVL constructs modality-specific parsing graphs and introduces a Guidance Module (GM) that hierarchically transfers high-level linguistic spatial relationship priors to low-level visual features. Furthermore, it incorporates recursive bidirectional cross-attention to establish a dual-path interaction mechanism—top-down decomposition and bottom-up composition—thereby significantly enhancing responsiveness to and precise localization of occluded body parts. Extensive experiments on mainstream benchmarks demonstrate that PGVL substantially outperforms existing methods under occlusion, particularly achieving superior robustness and localization accuracy in complex, highly occluded environments.

Technology Category

Application Category

📝 Abstract

Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

Problem

Research questions and friction points this paper is trying to address.

Integrating multimodal fusion for human pose estimation

Addressing occlusion challenges with visual-language interaction

Enhancing feature alignment and location accuracy in HPE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parse graph-based visual-language interaction for pose estimation

Guided Module enables cross-level feature fusion and guidance

Bidirectional cross-attention with top-down and bottom-up composition

🔎 Similar Papers

Pose Priors from Language Models

2024-05-06arXiv.orgCitations: 3

Authors to Follow