PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the limitations of existing methods in multimodal action quality assessment, which suffer from blurred modality-specific cues, cross-modal redundancy, and weakened phase-specific evidence due to coarse fusion or uniform modeling. To overcome these issues, the authors propose PIDNet, a Progressive Implicit Disentanglement and Fusion Network. PIDNet introduces an implicit disentanglement strategy to separate modality-specific and complementary cues, and incorporates an iMambaWave module that combines Bi-Mamba and wavelet transform branches to model long-range dependencies and time-frequency local details, respectively, adaptively fused via a gating mechanism. Furthermore, a three-stage progressive fusion architecture leverages Group3M to enable cross-modal complementary attention and multi-scale enhancement. Extensive experiments on the Rhythmic Gymnastics and Fis-V datasets demonstrate significant performance gains over state-of-the-art methods, with ablation studies confirming the effectiveness of each component and highlighting iMambaWave’s strong generalization and plug-and-play capabilities.

📝 Abstract

Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.

Problem

Research questions and friction points this paper is trying to address.

multimodal action quality assessment

heterogeneous modalities

progressive quality cues

cross-modal redundancy

temporal modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive fusion

implicit decoupling

iMambaWave