Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing skeleton-based temporal action segmentation methods neglect semantic correlations between joints and actions, limiting motion understanding. To address this, we propose a text-derived relational graph-guided dual-path enhancement framework. First, Dynamic Spatio-Temporal Fusion Modeling (DSFM) jointly captures spatio-temporal dependencies using joint and action graphs generated by large language models. Second, Absolute-Relative Inter-class Supervision (ARIS) integrates contrastive learning with textual embeddings to enable fine-grained, frame-level supervision. Additionally, Spatial-Aware Enhancement Processing (SAEP) improves generalization via geometric augmentations—including random joint masking and axial rotation—guided by spatial priors. Our method achieves state-of-the-art performance on four standard benchmarks, significantly improving action boundary localization accuracy and fine-grained action discrimination capability.

Technology Category

Application Category

📝 Abstract

Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

Enhance skeleton-based action segmentation using text-derived relational graphs.

Improve modeling of spatial and temporal relations in skeletal movements.

Regularize action class distributions with contrastive learning and text embeddings.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM-generated graphs for enhanced modeling

Uses contrastive learning with text embeddings

Incorporates spatial-aware enhancement techniques

🔎 Similar Papers

No similar papers found.