Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper addresses the dual challenges of multimodal domain shift and modality collaboration in few-shot video domain adaptation (FSVDA). To tackle these, we propose a low-rank decomposition-based modality disentanglement framework. Our method introduces a modality-cooperative low-rank decomposer and a multimodal decomposition router to explicitly decouple each modality’s features into shared and modality-specific components. To achieve fine-grained cross-domain alignment, we design a cross-domain activation consistency loss; additionally, orthogonal decorrelation constraints and parameter sharing are incorporated to enhance generalization. Evaluated on three standard benchmarks, our approach significantly outperforms existing methods, demonstrating superior robustness and generalization in aligning multimodal features—even under extremely limited target-domain labels. The results validate that our framework effectively mitigates modality-specific shifts while fostering synergistic cross-modal learning in the FSVDA setting.

Technology Category

Application Category

📝 Abstract

In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot video domain adaptation with multimodal domain shifts

Decomposes multimodal features into domain-alignment-friendly components

Enhances cross-domain consistency through collaborative modality decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank decomposition for multimodal feature separation

Shared parameter decomposers with selective activation routing

Cross-domain consistency loss for improved domain alignment

🔎 Similar Papers

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation