Multimodal Action Quality Assessment

📅 2024-01-31
🏛️ IEEE Transactions on Image Processing
📈 Citations: 5
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
For sports where background music significantly influences performance—such as figure skating and rhythmic gymnastics—this paper addresses the challenge of accurate Action Quality Assessment (AQA) under multimodal interference. We propose the Progressive Adaptive Multimodal Fusion Network (PAMFN), the first framework to jointly model RGB, optical flow, and audio modalities for AQA. Methodologically, PAMFN comprises modality-specific feature decoders, an adaptive fusion module (integrating FusionNets and PolicyNet), and a cross-modal feature decoder, enabling dynamic, progressive aggregation of multimodal features at the action-segment level. Unlike static fusion strategies, our approach adaptively weights modality contributions based on action semantics. Evaluated on two public AQA benchmarks, PAMFN achieves state-of-the-art performance, with particularly notable improvements in scoring accuracy for music-intensive, complex scenarios. These results empirically validate both the effectiveness and necessity of synergistic multimodal modeling for robust AQA.

Technology Category

Application Category

📝 Abstract
Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch. Our extensive experiments validate the efficacy of the proposed method, and our method achieves state-of-the-art performance on two public datasets. Code is available at https://github.com/qinghuannn/PAMFN.
Problem

Research questions and friction points this paper is trying to address.

Enhance action quality assessment using multimodal data.
Improve score regression accuracy with audio and visual fusion.
Develop adaptive fusion strategies for diverse action parts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Adaptive Multimodal Fusion Network (PAMFN)
Adaptive Fusion Module for dynamic fusion policies
Cross-modal Feature Decoder for feature transfer
🔎 Similar Papers
No similar papers found.