LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This work addresses the practical challenge of multimodal learning when certain modalities are inherently missing during training, breaking away from the prevailing reliance on fully observed multimodal data. It presents the first approach capable of effective multimodal learning without any complete training samples. By reformulating the task as a conditional sequence inference problem, the method leverages a large language model (LLM) to guide context-aware modality imputation and multidimensional representation fusion through prompting. Furthermore, it introduces a mask-aware dual-path aggregation mechanism to dynamically calibrate inference uncertainty. Extensive experiments on three action quality assessment benchmarks demonstrate substantial performance gains over state-of-the-art models, validating the efficacy and feasibility of this novel data-efficient paradigm.
📝 Abstract
Real-world multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a ``God's eye view'' of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code is available at https://github.com/XuHuangbiao/LIMSSR.
Problem

Research questions and friction points this paper is trying to address.

Incomplete Multimodal Learning
Training-Time Incomplete Observations
Multimodal Sequence-to-Score Reasoning
Modality Imputation
Data-Efficient Multimodal Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incomplete Multimodal Learning
Large Language Model
Sequence-to-Score Reasoning
Modality Imputation
Mask-Aware Aggregation
H
Huangbiao Xu
Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China; Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China
H
Huanqi Wu
Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China; Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China
X
Xiao Ke
Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China; Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China
Yuxin Peng
Yuxin Peng
Peking University
Cross-media Analysis and ReasoningImage & Video Understanding and RetrievalMachine Learning and Artificial Intelligence