Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the dual challenges of multimodal domain shift and modality collaboration in few-shot video domain adaptation (FSVDA). To tackle these, we propose a low-rank decomposition-based modality disentanglement framework. Our method introduces a modality-cooperative low-rank decomposer and a multimodal decomposition router to explicitly decouple each modality’s features into shared and modality-specific components. To achieve fine-grained cross-domain alignment, we design a cross-domain activation consistency loss; additionally, orthogonal decorrelation constraints and parameter sharing are incorporated to enhance generalization. Evaluated on three standard benchmarks, our approach significantly outperforms existing methods, demonstrating superior robustness and generalization in aligning multimodal features—even under extremely limited target-domain labels. The results validate that our framework effectively mitigates modality-specific shifts while fostering synergistic cross-modal learning in the FSVDA setting.

Technology Category

Application Category

📝 Abstract
In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot video domain adaptation with multimodal domain shifts
Decomposes multimodal features into domain-alignment-friendly components
Enhances cross-domain consistency through collaborative modality decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank decomposition for multimodal feature separation
Shared parameter decomposers with selective activation routing
Cross-domain consistency loss for improved domain alignment
Y
Yuyang Wanyan
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
X
Xiaoshan Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China, and PengCheng Laboratory, Shenzhen 518066, China
W
Weiming Dong
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision