🤖 AI Summary
This work addresses the practical challenge of multimodal learning when certain modalities are inherently missing during training, breaking away from the prevailing reliance on fully observed multimodal data. It presents the first approach capable of effective multimodal learning without any complete training samples. By reformulating the task as a conditional sequence inference problem, the method leverages a large language model (LLM) to guide context-aware modality imputation and multidimensional representation fusion through prompting. Furthermore, it introduces a mask-aware dual-path aggregation mechanism to dynamically calibrate inference uncertainty. Extensive experiments on three action quality assessment benchmarks demonstrate substantial performance gains over state-of-the-art models, validating the efficacy and feasibility of this novel data-efficient paradigm.
📝 Abstract
Real-world multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a ``God's eye view'' of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code is available at https://github.com/XuHuangbiao/LIMSSR.