Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key challenges in multimodal interview assessment—namely, the difficulty of fusing explicit and implicit behavioral cues and severe dimensional bias—by proposing the “365” holistic evaluation framework. It integrates video, audio, and text modalities to model candidate responses across six interview rounds along five core assessment dimensions. Methodologically, we design modality-specific feature extractors, employ a shared compressed multilayer perceptron for cross-modal feature alignment, and adopt a two-level ensemble strategy—comprising independent regression heads followed by mean pooling—to enhance prediction stability. Evaluated on the AVI Challenge 2025 benchmark, our framework achieves first place with a multidimensional average MSE of 0.1824, significantly outperforming all baselines. Results demonstrate superior accuracy, robustness, and fairness, validating the framework’s comprehensive advantages in unbiased multimodal behavioral assessment.

Technology Category

Application Category

📝 Abstract
Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365'' aspects of interview performance by integrating extit{three} modalities (video, audio, and text), extit{six} responses per candidate, and extit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.
Problem

Research questions and friction points this paper is trying to address.

Develops a multimodal framework for interview assessment
Integrates video, audio, and text for holistic evaluation
Ensures fair and unbiased performance scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal feature fusion via Shared Compression MLP
Two-level ensemble learning for robust predictions
365 aspects assessment integrating video, audio, text
J
Jia Li
Hefei University of Technology, Hefei, China
Y
Yang Wang
Hefei University of Technology, Hefei, China
W
Wenhao Qian
Hefei University of Technology, Hefei, China
Zhenzhen Hu
Zhenzhen Hu
Hefei University of Technology
Multimedia
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition
M
Meng Wang
Hefei University of Technology, Hefei, China